AUTOBIB: Automatic Extraction of Bibliographic Information on the Web

Authors:
Junfei Geng;Jun Yang
Affiliations:
Duke University;Duke University
Venue:
IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium
Year:
2004

Citing 0
Cited 7

Reference metadata extraction using a hierarchical knowledge representation framework

Decision Support Systems
A simple method for citation metadata extraction using hidden markov models

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Perception-oriented online news extraction

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Automatic wrapper induction from hidden-web sources with domain knowledge

Proceedings of the 10th ACM workshop on Web information and data management
An adaptive bottom up clustering approach for web news extraction

WOCC'09 Proceedings of the 18th international conference on Wireless and Optical Communications Conference
A trigram hidden Markov model for metadata extraction from heterogeneous references

Information Sciences: an International Journal
Recognising document components in XML-based academic articles

Proceedings of the 2013 ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Web has greatly facilitated access to information. However, information presented in HTML is mainly intended to be browsed by humans, and the problem of automatically extracting such information remains an important and challenging task. In this work, we focus on building a system called AUTOBIB to automate extraction of bibliographic information on the Web. We use a combination of bootstrapping, statistical, and heuristic methods to achieve a high degree of automation. To set up extraction from a new site, we only need to provide a few lines of code specifying how to download pages containing bibliographic information. We do not need to be concerned with each siteýs presentation format, and the system can cope with changes in the presentation format without human intervention. AUTOBIB bootstraps itself with a small seed database of structured bibliographic records. For each bibliographicWeb site, we identify segments within its pages that represent bibliographic records, using state-of-the-art record-boundary discovery techniques. Next, we find matches for some of these "raw records" in the seed database using a set of heuristics. These matches serve as a training set for a parser based on the Hidden Markov Model (HMM), which is then used to parse the rest of the raw records into structured records. We have found an effectiveHMM structure with special states that correspond to delimiters and HTML tags in raw records. Experiments demonstrate that for our application, this HMM structure achieves high success rates without the complexity of previously proposed structures.