Web-based closed-domain data extraction on online advertisements

Authors:
Maria S. Pera;Rani Qumsiyeh;Yiu-Kai Ng
Affiliations:
Computer Science Department, Brigham Young University, Provo, UT 84602, United States;Computer Science Department, Brigham Young University, Provo, UT 84602, United States;Computer Science Department, Brigham Young University, Provo, UT 84602, United States
Venue:
Information Systems
Year:
2013

Citing 18
Cited 1

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Machine Learning

Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Named entity recognition with a maximum entropy approach

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Multiclass reduced-set support vector machines

ICML '06 Proceedings of the 23rd international conference on Machine learning
Relaxed online SVMs for spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning (k,l)-contextual tree languages for information extraction from web pages

Machine Learning
Introduction to Information Retrieval

Introduction to Information Retrieval
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

The Journal of Machine Learning Research
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Automatically Extracting Form Labels

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
An improved hierarchical Bayesian model of language for document classification

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Creating relational data from unstructured and ungrammatical data sources

Journal of Artificial Intelligence Research
An empirical study on using hidden markov model for search interface segmentation

Proceedings of the 18th ACM conference on Information and knowledge management
ONDUX: on-demand unsupervised learning for information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Automatic wrappers for large scale web extraction

Proceedings of the VLDB Endowment

Leveraging spatial join for robust tuple extraction from web pages

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Taking advantage of the popularity of the web, online marketplaces such as Ebay (.com), advertisements (ads for short) websites such as Craigslist(.org), and commercial websites such as Carmax(.com) (allow users to) post ads on a variety of products and services. Instead of browsing through numerous websites to locate ads of interest, web users would benefit from the existence of a single, fully integrated database (DB) with ads in multiple domains, such as Cars-for-Sale and Job-Postings, populated from various online sources so that ads of interest could be retrieved at a centralized site. Since existing ads websites impose their own structures and formats for storing and accessing ads, generating a uniform, integrated ads repository is not a trivial task. The challenges include (i) identifying ads domains, (ii) dealing with the diversity in structures of ads in various ads domains, and (iii) analyzing data with different meanings in each ads domain. To handle these problems, we introduce ADEx, a tool that relies on various machine learning approaches to automate the process of extracting (un-/semi-/fully- structured) data from online ads to create ads records archived in an underlying DB through domain classification, keyword tagging, and identification of valid attribute values. Experimental results generated using a dataset of 18,000 online ads originated from Craigslist, Ebay, and KSL(.com) show that ADEx is superior in performance compared with existing text classification, keyword labeling, and data extraction approaches. Further evaluations verify that ADEx either outperforms or performs at least as good as current state-of-the-art information extractors in mapping data from unstructured or (semi-)structured sources into DB records.