Persistent homology: an introduction and a new text representation for natural language processing

Authors:
Xiaojin Zhu
Affiliations:
Department of Computer Sciences, University of Wisconsin-Madison, Madison, Wisconsin
Venue:
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Year:
2013

Citing 6
Cited 0

Learning Curved Multinomial Subfamilies for Natural Language Processing and Information Retrieval

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
The SMART Retrieval System—Experiments in Automatic Document Processing

The SMART Retrieval System—Experiments in Automatic Document Processing
The Locally Weighted Bag of Words Framework for Document Representation

The Journal of Machine Learning Research
Persistent voids

Bioinformatics
Persistence Diagrams of Cortical Surface Data

IPMI '09 Proceedings of the 21st International Conference on Information Processing in Medical Imaging
Exploring uses of persistent homology for statistical analysis of landmark-based shape data

Journal of Multivariate Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Persistent homology is a mathematical tool from topological data analysis. It performs multi-scale analysis on a set of points and identifies clusters, holes, and voids therein. These latter topological structures complement standard feature representations, making persistent homology an attractive feature extractor for artificial intelligence. Research on persistent homology for AI is in its infancy, and is currently hindered by two issues: the lack of an accessible introduction to AI researchers, and the paucity of applications. In response, the first part of this paper presents a tutorial on persistent homology specifically aimed at a broader audience without sacrificing mathematical rigor. The second part contains one of the first applications of persistent homology to natural language processing. Specifically, our Similarity Filtration with Time Skeleton (SIFTS) algorithm identifies holes that can be interpreted as semantic "tie-backs" in a text document, providing a new document structure representation. We illustrate our algorithm on documents ranging from nursery rhymes to novels, and on a corpus with child and adolescent writings.