An Approach to Identify Duplicated Web Pages
COMPSAC '02 Proceedings of the 26th International Computer Software and Applications Conference on Prolonging Software Life: Development and Redevelopment
Framework for mining web content outliers
Proceedings of the 2004 ACM symposium on Applied computing
Mining web content outliers using structure oriented weighting techniques and N-grams
Proceedings of the 2005 ACM symposium on Applied computing
WCOND-Mine: Algorithm for Detecting Web Content Outliers from Web Documents
ISCC '05 Proceedings of the 10th IEEE Symposium on Computers and Communications
The Research of Web Page De-duplication Based on Web Pages Reshipment Statement
DBTA '09 Proceedings of the 2009 First International Workshop on Database Technology and Applications
Hybrid approach to web content outlier mining without query vector
DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Hi-index | 0.00 |
In this Internet era, the WWW is flooded with voluminous amount of information with more replicated and irrelevant web pages. As the unnecessary and duplicated web pages increase the indexing space and time complexity, finding and removing these pages become a significant issue among the information retrieval and web mining research communities as most of the people rely on search engines to get the required information. Web content outlier mining plays a decisive role in covering all these aspects. Existing algorithms for web content outlier mining focuses attention on applying weightage only to structured documents whereas in this research work, a mathematical approach based on two way rectangular representations, signed approach of trust rating and correlation method is developed for retrieving right information without duplicates present in both structured and unstructured web documents.