Self-supervised learning approach for extracting citation information on the web

Authors:
Dat T. Huynh;Wen Hua
Affiliations:
School of Information Technology and Electrical Engineering, The University of Queensland, Australia;School of Information Technology and Electrical Engineering, The University of Queensland, Australia
Venue:
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Year:
2012

Citing 12
Cited 1

Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Information extraction from research papers using conditional random fields

Information Processing and Management: an International Journal
Reference metadata extraction using a hierarchical knowledge representation framework

Decision Support Systems
Information Extraction

Foundations and Trends in Databases
A flexible approach for extracting metadata from bibliographic citations

Journal of the American Society for Information Science and Technology
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment

Mining Publication Records on Personal Publication Web Pages Based on Conditional Random Fields

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a framework for automatically training a model to extract citation information on the web. Constructing manually labeled training data to learn an extraction model is tedious, time consuming and difficult to be applied to several styles of citations with different types of entities. To eliminate the requirement of manually labeled training data, we exploit a knowledge base of citation domain and web search to derive labeled training data automatically. Our experiments show that the combination of knowledge base, heuristics and statistical methods can automate the extraction process and achieve good performance.