Legal document clustering with built-in topic segmentation

Authors:
Qiang Lu;Jack G. Conrad;Khalid Al-Kofahi;William Keenan
Affiliations:
Thomson Reuters, Rochester, NY, USA;Thomson Reuters, Saint Paul, MN, USA;Thomson Reuters, Saint Paul, MN, USA;Thomson Reuters, Rochester, NY, USA
Venue:
Proceedings of the 20th ACM international conference on Information and knowledge management
Year:
2011

Citing 18
Cited 2

Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Combining multiple classifiers for text categorization

Proceedings of the tenth international conference on Information and knowledge management
FREM: fast and robust EM clustering for large data sets

Proceedings of the eleventh international conference on Information and knowledge management
An Efficient Fuzzy C-Means Clustering Algorithm

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
En Route to Data Mining in Legal Text Corpora: Clustering, Neural Computation, and International Treaties

DEXA '97 Proceedings of the 8th International Workshop on Database and Expert Systems Applications
Latent dirichlet allocation

The Journal of Machine Learning Research
TextTiling: segmenting text into multi-paragraph subtopic passages

Computational Linguistics
Advances in domain independent linear text segmentation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Similarity between words computed by spreading activation on an English dictionary

EACL '93 Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics
A model of lexical attraction and repulsion

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Text segmentation based on similarity between words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
A statistical model for domain-independent text segmentation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Effective document clustering for large heterogeneous law firm collections

ICAIL '05 Proceedings of the 10th international conference on Artificial intelligence and law
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A statistical model for topic segmentation and clustering

Canadian AI'08 Proceedings of the Canadian Society for computational studies of intelligence, 21st conference on Advances in artificial intelligence
Data-Intensive Text Processing with MapReduce

Data-Intensive Text Processing with MapReduce

Linked data classification: a feature-based approach

Proceedings of the Joint EDBT/ICDT 2013 Workshops
The significance of evaluation in AI and law: a case study re-examining ICAIL proceedings

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a useful tool for helping users navigate, summarize, and organize large quantities of textual documents available on the Internet, in news sources, and in digital libraries. A variety of clustering methods have also been applied to the legal domain, with various degrees of success. Some unique characteristics of legal content as well as the nature of the legal domain present a number of challenges. For example, legal documents are often multi-topical, contain carefully crafted, professional, domain-specific language, and possess a broad and unevenly distributed coverage of legal issues. Moreover, unlike widely accessible documents on the Internet, where search and categorization services are generally free, the legal profession is still largely a fee-for-service field that makes the quality (e.g., in terms of both recall and precision) a key differentiator of provided services. This paper introduces a classification-based recursive soft clustering algorithm with built-in topic segmentation. The algorithm leverages existing legal document metadata such as topical classifications, document citations, and click stream data from user behavior databases, into a comprehensive clustering framework. Techniques associated with the algorithm have been applied successfully to very large databases of legal documents, which include judicial opinions, statutes, regulations, administrative materials and analytical documents. Extensive evaluations were conducted to determine the efficiency and effectiveness of the proposed algorithm. Subsequent evaluations conducted by legal domain experts have demonstrated that the quality of the resulting clusters based upon this algorithm is similar to those created by domain experts.