Legal document clustering with built-in topic segmentation

  • Authors:
  • Qiang Lu;Jack G. Conrad;Khalid Al-Kofahi;William Keenan

  • Affiliations:
  • Thomson Reuters, Rochester, NY, USA;Thomson Reuters, Saint Paul, MN, USA;Thomson Reuters, Saint Paul, MN, USA;Thomson Reuters, Rochester, NY, USA

  • Venue:
  • Proceedings of the 20th ACM international conference on Information and knowledge management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering is a useful tool for helping users navigate, summarize, and organize large quantities of textual documents available on the Internet, in news sources, and in digital libraries. A variety of clustering methods have also been applied to the legal domain, with various degrees of success. Some unique characteristics of legal content as well as the nature of the legal domain present a number of challenges. For example, legal documents are often multi-topical, contain carefully crafted, professional, domain-specific language, and possess a broad and unevenly distributed coverage of legal issues. Moreover, unlike widely accessible documents on the Internet, where search and categorization services are generally free, the legal profession is still largely a fee-for-service field that makes the quality (e.g., in terms of both recall and precision) a key differentiator of provided services. This paper introduces a classification-based recursive soft clustering algorithm with built-in topic segmentation. The algorithm leverages existing legal document metadata such as topical classifications, document citations, and click stream data from user behavior databases, into a comprehensive clustering framework. Techniques associated with the algorithm have been applied successfully to very large databases of legal documents, which include judicial opinions, statutes, regulations, administrative materials and analytical documents. Extensive evaluations were conducted to determine the efficiency and effectiveness of the proposed algorithm. Subsequent evaluations conducted by legal domain experts have demonstrated that the quality of the resulting clusters based upon this algorithm is similar to those created by domain experts.