An efficient approach to clustering real-estate listings

Authors:
Maciej Grzenda;Deepak Thukral
Affiliations:
Warsaw University of Technology, Faculty of Mathematics and Information Science, Warszawa, Poland;TESOBE Music Pictures Ltd., Berlin, Germany
Venue:
IDEAL'10 Proceedings of the 11th international conference on Intelligent data engineering and automated learning
Year:
2010

Citing 10
Cited 0

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
An Automated Integration Approach for Semi-Structured and Structured Data

CODAS '01 Proceedings of the Third International Symposium on Cooperative Database Systems for Advanced Applications
Visualizing Real Estate Property Information on the Web

IV '99 Proceedings of the 1999 International Conference on Information Visualisation
Data Extraction from Semi-structured Web Pages by Clustering

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A Survey on Web Content Mining and Extraction of Structured and Semistructured Data

ICETET '08 Proceedings of the 2008 First International Conference on Emerging Trends in Engineering and Technology
A Framework for Extracting Information from Semi-Structured Web Data Sources

ICCIT '08 Proceedings of the 2008 Third International Conference on Convergence and Hybrid Information Technology - Volume 01
Handling incomplete data using evolution of imputation methods

ICANNGA'09 Proceedings of the 9th international conference on Adaptive and natural computing algorithms
Fuzzy c-means clustering of incomplete data

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Quantified Score

Hi-index	0.00

Visualization

Abstract

World Wide Web (WWW) is a vast source of information, the problem of information overload is more acute than ever. Due to noise in WWW, it is becoming hard to find usable information. Real-estate listings are frequently available through different real estate agencies and published on different web sites. As a consequence, differences in price and description can also be observed. At the same time, a potential buyer or renter may prefer to get the entire description of a property of interest based on the data available on different portals and if possible track the changes in price. This problem can be considered as an illustration of a wider class of problems with integrating the data from numerous semistructured web data sources. The paper investigates the way clustering algorithms can be used to identify individual real estate properties described on different portals. Clustering algorithms have been used to group the records acquired from different web sources. Both standard clustering methods have been evaluated, and a method using new distance function combining similarity of semi-structured and unstructured data has been proposed. The latter approach has allowed substantial improvement in clustering results.