Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces
ICDT '01 Proceedings of the 8th International Conference on Database Theory
Introduction to Information Retrieval
Introduction to Information Retrieval
Detecting groups of anomalously similar objects in large data sets
IDA'05 Proceedings of the 6th international conference on Advances in Intelligent Data Analysis
Hi-index | 0.00 |
We present a novel approach to discovering small groups of anomalously similar pieces of free text. The UK's National Reporting and Learning System (NRLS) contains free text and categorical variables describing several million patient safety incidents that have occurred in the National Health Service. The groups of interest represent previously unknown incident types. The task is particularly challenging because the free text descriptions are of random lengths, from very short to quite extensive, and include arbitrary abbreviations and misspellings, as well as technical medical terms. Incidents of the same type may also be described in various different ways. The aim of the analysis is to produce a global, numerical model of the text, such that the relative positions of the incidents in the model space reflect their meanings. A high dimensional vector space model of the text passages is produced; TF-IDF term weighting is applied, reflecting the differing importance of particular words to a description's meaning. The dimensionality of the model space is reduced, using principal component and linear discriminant analysis. The supervised analysis uses categorical variables from the NRLS, and allows incidents of similar meaning to be positioned close to one another in the model space. Anomaly detection tools are then used to find small groups of descriptions that are more similar than one would expect. The results are evaluated by having the groups assessed qualitatively by domain experts to see whether they are of substantive interest.