Foundations of statistical natural language processing
Foundations of statistical natural language processing
Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Document clustering based on non-negative matrix factorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Named entity recognition with a maximum entropy approach
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Multiclass reduced-set support vector machines
ICML '06 Proceedings of the 23rd international conference on Machine learning
Relaxed online SVMs for spam filtering
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval
Introduction to Information Retrieval
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
The Journal of Machine Learning Research
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Automatically Extracting Form Labels
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
An improved hierarchical Bayesian model of language for document classification
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Creating relational data from unstructured and ungrammatical data sources
Journal of Artificial Intelligence Research
An empirical study on using hidden markov model for search interface segmentation
Proceedings of the 18th ACM conference on Information and knowledge management
ONDUX: on-demand unsupervised learning for information extraction
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Automatic extraction of web data records containing user-generated content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Automatic wrappers for large scale web extraction
Proceedings of the VLDB Endowment
Leveraging spatial join for robust tuple extraction from web pages
Information Sciences: an International Journal
Hi-index | 0.00 |
Taking advantage of the popularity of the web, online marketplaces such as Ebay (.com), advertisements (ads for short) websites such as Craigslist(.org), and commercial websites such as Carmax(.com) (allow users to) post ads on a variety of products and services. Instead of browsing through numerous websites to locate ads of interest, web users would benefit from the existence of a single, fully integrated database (DB) with ads in multiple domains, such as Cars-for-Sale and Job-Postings, populated from various online sources so that ads of interest could be retrieved at a centralized site. Since existing ads websites impose their own structures and formats for storing and accessing ads, generating a uniform, integrated ads repository is not a trivial task. The challenges include (i) identifying ads domains, (ii) dealing with the diversity in structures of ads in various ads domains, and (iii) analyzing data with different meanings in each ads domain. To handle these problems, we introduce ADEx, a tool that relies on various machine learning approaches to automate the process of extracting (un-/semi-/fully- structured) data from online ads to create ads records archived in an underlying DB through domain classification, keyword tagging, and identification of valid attribute values. Experimental results generated using a dataset of 18,000 online ads originated from Craigslist, Ebay, and KSL(.com) show that ADEx is superior in performance compared with existing text classification, keyword labeling, and data extraction approaches. Further evaluations verify that ADEx either outperforms or performs at least as good as current state-of-the-art information extractors in mapping data from unstructured or (semi-)structured sources into DB records.