Structured entity identification and document categorization: two tasks with one joint model

Authors:
Indrajit Bhattacharya;Shantanu Godbole;Sachindra Joshi
Affiliations:
IBM India Research Lab, New Delhi, India;IBM India Research Lab, New Delhi, India;IBM India Research Lab, New Delhi, India
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 17
Cited 9

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Multitask Learning

Machine Learning - Special issue on inductive transfer
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Similarity Flooding: A Versatile Graph Matching Algorithm and Its Application to Schema Matching

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Latent dirichlet allocation

The Journal of Machine Learning Research
Learning probabilistic models of link structure

The Journal of Machine Learning Research
Cross-training: learning probabilistic mappings between topics

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Semantic integration in text: from ambiguous names to identifiable entities

AI Magazine - Special issue on semantic integration
Efficiently linking text documents with relevant structured information

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Collective entity resolution in relational data

ACM Transactions on Knowledge Discovery from Data (TKDD)
Object identification with attribute-mediated dependences

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases

Connections between the lines: augmenting social networks with text

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Towards combining web classification and web information extraction: a case study

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying graphs from noisy and incomplete data

Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data
Generic Entity Resolution in Relational Databases

ADBIS '09 Proceedings of the 13th East European Conference on Advances in Databases and Information Systems
Injecting Structured Data to Generative Topic Model in Enterprise Settings

ACML '09 Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning
Identifying graphs from noisy and incomplete data

ACM SIGKDD Explorations Newsletter
EagleEye: entity-centric business intelligence for smarter decisions

IBM Journal of Research and Development
Graph identification

IDA'10 Proceedings of the 9th international conference on Advances in Intelligent Data Analysis
Flexible and efficient distributed resolution of large entities

FoIKS'12 Proceedings of the 7th international conference on Foundations of Information and Knowledge Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditionally, research in identifying structured entities in documents has proceeded independently of document categorization research. In this paper, we observe that these two tasks have much to gain from each other. Apart from direct references to entities in a database, such as names of person entities, documents often also contain words that are correlated with discriminative entity attributes, such age-group and income-level of persons. This happens naturally in many enterprise domains such as CRM, Banking, etc. Then, entity identification, which is typically vulnerable against noise and incompleteness in direct references to entities in documents, can benefit from document categorization with respect to such attributes. In return, entity identification enables documents to be categorized according to different label-sets arising from entity attributes without requiring any supervision. In this paper, we propose a probabilistic generative model for joint entity identification and document categorization. We show how the parameters of the model can be estimated using an EM algorithm in an unsupervised fashion. Using extensive experiments over real and semi-synthetic data, we demonstrate that the two tasks can benefit immensely from each other when performed jointly using the proposed model.