Automatic metadata mining from multilingual enterprise content

Authors:
Melike Şah;Vincent Wade
Affiliations:
Knowledge and Data Engineering Group, Trinity College Dublin, Ireland;Knowledge and Data Engineering Group, Trinity College Dublin, Ireland
Venue:
Web Semantics: Science, Services and Agents on the World Wide Web
Year:
2012

Citing 27
Cited 0

Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic

Fuzzy Sets and Systems - Special issue: fuzzy sets: where do we stand? Where do we go?
DocBook: The Definitive Guide with CD-ROM

DocBook: The Definitive Guide with CD-ROM
Automatic metadata generation & evaluation

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Adaptive Hypermedia

User Modeling and User-Adapted Interaction
Automatic Ontology-Based Knowledge Extraction from Web Documents

IEEE Intelligent Systems
Measuring Similarity between Ontologies

EKAW '02 Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantic Web
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Towards the self-annotating web

Proceedings of the 13th international conference on World Wide Web
Metaextract: an NLP system to automatically assign metadata

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Automatic metadata generation based on neural network

InfoSecu '04 Proceedings of the 3rd international conference on Information security
Automating metadata generation: the simple indexing interface

WWW '05 Proceedings of the 14th international conference on World Wide Web
Bottom-Up Extraction and Trust-Based Refinement of Ontology Metadata

IEEE Transactions on Knowledge and Data Engineering
OCCS: Enabling the Dynamic Discovery, Harvesting and Delivery of Educational Content from Open Corpus Sources

ICALT '08 Proceedings of the 2008 Eighth IEEE International Conference on Advanced Learning Technologies
Automatic Extraction of Pedagogic Metadata from Learning Content

International Journal of Artificial Intelligence in Education
Personalised Web Experiences: Seamless Adaptivity across Web Service Composition and Web Content

UMAP '09 Proceedings of the 17th International Conference on User Modeling, Adaptation, and Personalization: formerly UM and AH
Web Document Classification Based on Fuzzy k-NN Algorithm

CIS '09 Proceedings of the 2009 International Conference on Computational Intelligence and Security - Volume 01
Semantic annotation, indexing, and retrieval

Web Semantics: Science, Services and Agents on the World Wide Web
The adaptive web: methods and strategies of web personalization

The adaptive web: methods and strategies of web personalization
Personalized search on the world wide web

The adaptive web
Adaptive content presentation for the web

The adaptive web
Automated template-based metadata extraction architecture

ICADL'07 Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers
Automatic metadata extraction from multilingual enterprise content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Automatic mining of cognitive metadata using fuzzy inference

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
Personalisation in the wild: providing personalisation across semantic, social and open-web resources

Proceedings of the 22nd ACM conference on Hypertext and hypermedia
OntoExtractor: a fuzzy-based approach in clustering semi-structured data sources and metadata generation

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part I
Header metadata extraction from semi-structured documents using template matching

OTM'06 Proceedings of the 2006 international conference on On the Move to Meaningful Internet Systems: AWeSOMe, CAMS, COMINF, IS, KSinBIT, MIOS-CIAO, MONET - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Personalization is increasingly vital especially for enterprises to be able to reach their customers. The key challenge in supporting personalization is the need for rich metadata, such as metadata about structural relationships, subject/concept relations between documents and cognitive metadata about documents (e.g. difficulty of a document). Manual annotation of large knowledge bases with such rich metadata is not scalable. As well as, automatic mining of cognitive metadata is challenging since it is very difficult to understand underlying intellectual knowledge about document automatically. On the other hand, the Web content is increasing becoming multilingual since growing amount of data generated on the Web is non-English. Current metadata extraction systems are generally based on English content and this requires to be revolutionized in order to adapt to the changing dynamics of the Web. To alleviate these problems, we introduce a novel automatic metadata extraction framework, which is based on a novel fuzzy based method for automatic cognitive metadata generation and uses different document parsing algorithms to extract rich metadata from multilingual enterprise content using the newly developed DocBook, Resource Type and Topic ontologies. Since the metadata generation process is based upon DocBook structured enterprise content, our framework is focused on enterprise documents and content which is loosely based on the DocBook type of formatting. DocBook is a common documentation formatting to formally produce corporate data and it is adopted by many enterprises. The proposed framework is illustrated and evaluated on English, German and French versions of the Symantec Norton 360 knowledge bases. The user study showed that the proposed fuzzy-based method generates reasonably accurate values with an average precision of 89.39% on the metadata values of document difficulty, document interactivity level and document interactivity type. The proposed fuzzy inference system achieves improved results compared to a rule-based reasoner for difficulty metadata extraction (~11% enhancement). In addition, user perceived metadata quality scores (mean of 5.57 out of 6) found to be high and automated metadata analysis showed that the extracted metadata is high quality and can be suitable for personalized information retrieval.