Extracting structured subject information from digital document archives

Authors:
Jyi-Shane Liu;Ching-Ying Lee
Affiliations:
Department of Computer Science, National Chengchi University, Taiwan, R.O.C.;Department of English, National Taiwan Normal University, Taiwan, R.O.C.
Venue:
ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Year:
2006

Citing 5
Cited 1

Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Information Extraction: Techniques and Challenges

SCIE '97 International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology
Can We Make Information Extraction More Adaptive?

Information Extraction: Towards Scalable, Adaptable Systems
Finite-state transducers in language and speech processing

Computational Linguistics
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

Towards knowledge extraction from weblogs and rule-based semantic querying

RuleML'07 Proceedings of the 2007 international conference on Advances in rule interchange and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information extraction (IE) techniques are capable of decoding targeted subject information in documents, and reducing text data into a set of structured core information. The implication for digital libraries is that IE potentially serves as an enabling tool to extend the value of digital document archives. We present an approach, called sandwich extraction pattern, to address the closely coupled template relation tasks. The approach provides interactive capabilities for task specification, domain knowledge acquisition, and output evaluation. This allows users (e.g. librarians) to have direct control on the design of value-added content products and the performance of IE tools. We conducted empirical validation by implementing an IE system, called SEP, and field testing it in a practical document archive. Encouraged by successful test runs, NCCU library has formally initiated a project to develop a value-added content product of government personnel gazettes, including document images, electronic texts, and personnel changes database.