Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Noise reduction in a statistical approach to text categorization
SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
KEA: practical automatic keyphrase extraction
Proceedings of the fourth ACM conference on Digital libraries
Authoritative sources in a hyperlinked environment
Journal of the ACM (JACM)
Measuring Search Engine Quality
Information Retrieval
Innovating web page classification through reducing noise
Journal of Computer Science and Technology
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Finding Near-Replicas of Documents and Servers on the Web
WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Automatically Generating an E-textbook on the Web
World Wide Web
Near-replicas of web pages detection efficient algorithm based on single MD5 fingerprint
ICAI'07 Proceedings of the 8th Conference on 8th WSEAS International Conference on Automation and Information - Volume 8
Concept hierarchy construction by combining spectral clustering and subsumption estimation
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Enhancing duplicate collection detection through replica boundary discovery
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Hi-index | 0.00 |
Aiming to meet the common requirements of several typical web applications, we propose a new preprocessing framework and the corresponding approach. The framework includes three parts: Web page cleaning, replica removal and Web page integration. After the preprocessing stage, Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content, relevant hyperlinks. Most of them are meta data, while the latter two are content data. The approach first partitions a page into several content blocks according to some selected tags in the markup tag tree. Based on a set of heuristics, it identifies the blocks that contain the topic content of the page. Then a quantitative measure (a feature vector) of the blocks with respect to the topic is obtained. From the topic feature vector, the elements of DocView are extracted by corresponding algorithms. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work. The preprocessing framework and approach have been applied to our search engine (Tianwang [15]) and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web applications.