Information extraction with automatic knowledge expansion

  • Authors:
  • Hanmin Jung;Eunji Yi;Dongseok Kim;Gary Geunbae Lee

  • Affiliations:
  • Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, Pohang, Kyungbuk 790-784, South Korea;Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, Pohang, Kyungbuk 790-784, South Korea;Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, Pohang, Kyungbuk 790-784, South Korea;Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, Pohang, Kyungbuk 790-784, South Korea

  • Venue:
  • Information Processing and Management: an International Journal
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

POSIE (POSTECH Information Extraction System) is an information extraction system which uses multiple learning strategies, i.e., SmL, user-oriented learning, and separate-context learning, in a question answering framework. POSIE replaces laborious annotation with automatic instance extraction by the SmL from structured Web documents, and places the user at the end of the user-oriented learning cycle. Information extraction as question answering simplifies the extraction procedures for a set of slots. We introduce the techniques verified on the question answering framework, such as domain knowledge and instance rules, into an information extraction problem. To incrementally improve extraction performance, a sequence of the user-oriented learning and the separate-context learning produces context rules and generalizes them in both the learning and extraction phases. Experiments on the "continuing education" domain initially show that the F1-measure becomes 0.477 and recall 0.748 with no user training. However, as the size of the training documents grows, the F1-measure reaches beyond 0.75 with recall 0.772. We also obtain F-measure of about 0.9 for five out of seven slots on "job offering" domain.