FLUX-CIM: flexible unsupervised extraction of citation metadata

Authors:
Eli Cortez;Altigran S. da Silva;Marcos André Gonçalves;Filipe Mesquita;Edleno S. de Moura
Affiliations:
Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal de Minas Gerais, Belo Horizonte, Brazil;Universidade Federal do Amazonas, Manaus, Brazil;Universidade Federal do Amazonas, Manaus, Brazil
Venue:
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Year:
2007

Citing 23
Cited 14

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
A brief survey of web data extraction tools

ACM SIGMOD Record
DEByE - Date extraction by example

Data & Knowledge Engineering
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Digital Libraries and Autonomous Citation Indexing

Computer
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Information Extraction with HMM Structures Learned by Stochastic Optimization

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Metaextract: an NLP system to automatically assign metadata

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic extraction of titles from general documents using machine learning

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Developing practical automatic metadata assignment and evaluation tools for internet resources

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Link-based similarity measures for the classification of Web documents

Journal of the American Society for Information Science and Technology
A comparative study of citations and links in document classification

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces

Information Processing and Management: an International Journal
Are your citations clean?

Communications of the ACM
An analysis of research on information reuse and ntegration

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration

A simple method for citation metadata extraction using hidden markov models

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
CEBBIP: a parser of bibliographic information in chinese electronic books

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
An environment for building, exploring and querying academic social networks

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
FireCite: lightweight real-time reference string extraction from webpages

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
ONDUX: on-demand unsupervised learning for information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Unsupervised strategies for information extraction by text segmentation

Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research
Meta-metadata: a metadata semantics language for collection representation applications

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Local adaptive extraction of references

KI'10 Proceedings of the 33rd annual German conference on Advances in artificial intelligence
A trigram hidden Markov model for metadata extraction from heterogeneous references

Information Sciences: an International Journal
A hybrid two-stage approach for discipline-independent canonical representation extraction from references

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Web-based citation parsing, correction and augmentation

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Improved bibliographic reference parsing based on repeated patterns

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Extracting and matching authors and affiliations in scholarly documents

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Research endogamy as an indicator of conference quality

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we propose a knowledge-base approach to help extracting the correct components of citations in any given format. Differently from related approaches that rely on manually built knowledge-bases (KBs) for recognizing the components of a citation, in our case, such a KB is automatically constructed from an existing set of sample metadata records from a given area (e.g., computer science or health sciences). Our approach does not rely on patterns encoding specific delimitators of a particular citation style. It is also unsupervised, in the sense that it does not rely on a learning method that requires a training phase. These features assign to our technique a high degree of automation and flexibility. To demonstrate the effectiveness and applicability of our proposed approach we have run experiments in which we applied it to extract information from citations in papers of two different domains. Results of these experiments indicate precision and recall levels above 94% and perfect extraction for the large majority of citations tested.