Person attribute extraction from the textual parts of web pages

Authors:
T. István Nagy
Affiliations:
University of Szeged, Department of Informatics, Hungary
Venue:
Acta Cybernetica
Year:
2012

Citing 12
Cited 0

Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Mining the Web's Link Structure

Computer
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Gossip Galore: a self-learning agent for exchanging pop trivia

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session
The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
CU-COMSEM: exploring rich features for unsupervised web personal name disambiguation

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
PSNUS: web people name disambiguation by simple clustering with rich features

SemEval '07 Proceedings of the 4th International Workshop on Semantic Evaluations
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Domain adaptation of rule-based annotators for named-entity recognition tasks

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a web mining system that clusters persons sharing the same name and also extracts bibliographical information about them. The input of our system is the result of web search engine queries in English or in Hungarian. For system evaluation in English, our system (RGAI) participated in the third Web People Search Task challenge [1]. The chief characteristics of our approach compared to the others are that we focus on the raw textual parts of the web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI system achieved top results on the person attribute extraction subtask, and average results on the person clustering subtask. Following the shared task annotation principles, we also manually constructed a Hungarian person disambiguation corpus and adapted our system from English to Hungarian. We present experimental results on this as well.