A maximum entropy approach to natural language processing
Computational Linguistics
An analysis of Web page and Web site constancy and permanence
Journal of the American Society for Information Science
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Automating Web navigation with the WebVCR
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Text Classification from Labeled and Unlabeled Documents using EM
Machine Learning - Special issue on information retrieval
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting Patterns and Relations from the World Wide Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Text categorization based on k-nearest neighbor approach for web site classification
Information Processing and Management: an International Journal
Frequent term-based text clustering
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Journal of Machine Learning Research
Web page feature selection and classification using neural networks
Information Sciences—Informatics and Computer Science: An International Journal - Special issue: Informatics and computer science intelligent systems applications
Using urls and table layout for web classification tasks
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages
Software—Practice & Experience - Special issue: Web technologies
Enterprise information integration: successes, challenges and controversies
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Learning to crawl: Comparing classification schemes
ACM Transactions on Information Systems (TOIS)
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Adaptive information extraction
ACM Computing Surveys (CSUR)
YALE: rapid prototyping for complex data mining tasks
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Communications of the ACM - ACM at sixty: a look back in time
The Role of URLs in Objectionable Web Content Categorization
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Combining content and link for classification using matrix factorization
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
User oriented link function classification
Proceedings of the 17th international conference on World Wide Web
Enhancing web page classification through image-block importance analysis
Information Processing and Management: an International Journal
Proceedings of the VLDB Endowment
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Do not crawl in the DUST: Different URLs with similar text
ACM Transactions on the Web (TWEB)
Foundations and Trends in Databases
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
Applying Link-Based Classification to Label Blogs
Advances in Web Mining and Web Usage Analysis
Web Navigation Sequences Automation in Modern Websites
DEXA '09 Proceedings of the 20th International Conference on Database and Expert Systems Applications
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
Information Sciences: an International Journal
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
A statistical approach to URL-based web page clustering
Proceedings of the 21st international conference companion on World Wide Web
Towards discovering conceptual models behind web sites
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
A Survey on Region Extractors from Web Documents
IEEE Transactions on Knowledge and Data Engineering
Hi-index | 0.00 |
Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would make them suitable for enterprise web information integration, namely: to be based on a lightweight crawling, so as to avoid interfering with the normal operation of the web site, to be unsupervised, which avoids the need for a training set of pre-classified pages, or to use features from outside the page to be classified, which avoids having to download it. In this article, we propose CALA, a new automated proposal to generate URL-based web page classifiers. Our proposal builds a number of URL patterns that represent the different classes of pages in a web site, so further pages can be classified by matching their URLs to the patterns. Its salient features are that it fulfills all of the previous requirements, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efficient in practice.