Clustering and categorization of Brazilian portuguese legal documents

  • Authors:
  • Luis Otávio de Colla Furquim;Vera Lúcia Strube de Lima

  • Affiliations:
  • Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil;Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil

  • Venue:
  • PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This study explores the use of machine learning in case law search in electronic trials. We clustered case law documents, automatically generating classes to a categorizer. These classes are used when a user uploads new documents to an electronic trial. We selected the algorithm TClus, created by Aggarwal, Gates and Yu, removing its document/group discarding features and adding a cluster division feature. We introduced a new paradigm "bag of terms and law references" instead of "bag of words" by generating attributes using a law domain thesaurus to detect legal terms and using regular expressions to detect law references. We clustered a case law corpus. The results were evaluated with the Relative Hardness Measure (RH) and the ρ-Measure (RHO). The results were tested both with Wilcoxon's Signed-ranks Test and Count of Wins and Losses Test to determine their significance. The categorization results were evaluated by human specialists. We compared true/false positives against document similarity with the centroid, cluster size, quantity and type of the attributes in the centroids and cluster cohesion. The article also discusses attribute generation and its implications to the classification results.