Clustering and categorization of Brazilian portuguese legal documents

Authors:
Luis Otávio de Colla Furquim;Vera Lúcia Strube de Lima
Affiliations:
Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil;Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil
Venue:
PROPOR'12 Proceedings of the 10th international conference on Computational Processing of the Portuguese Language
Year:
2012

Citing 14
Cited 0

Text classification in a hierarchical mixture model for small training sets

Proceedings of the tenth international conference on Information and knowledge management
Learning Belief Networks in the Presence of Missing Values and Hidden Variables

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The VLDB Journal — The International Journal on Very Large Data Bases
Combining clustering and co-training to enhance text classification using unlabelled data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
CBC: Clustering Based Text Classification Requiring Minimal Labeled Data

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
On Using Partial Supervision for Text Categorization

IEEE Transactions on Knowledge and Data Engineering
Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Is linguistic information relevant for the classification of legal texts?

ICAIL '05 Proceedings of the 10th international conference on Artificial intelligence and law
Hierarchically SVM classification based on support vector clustering method and its application to document categorization

Expert Systems with Applications: An International Journal
Support cluster machine

Proceedings of the 24th international conference on Machine learning
On the relative hardness of clustering corpora

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Evaluation of internal validity measures in short-text corpora

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
Learning the dimensionality of hidden variables

UAI'01 Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence
The Bayesian structural EM algorithm

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This study explores the use of machine learning in case law search in electronic trials. We clustered case law documents, automatically generating classes to a categorizer. These classes are used when a user uploads new documents to an electronic trial. We selected the algorithm TClus, created by Aggarwal, Gates and Yu, removing its document/group discarding features and adding a cluster division feature. We introduced a new paradigm "bag of terms and law references" instead of "bag of words" by generating attributes using a law domain thesaurus to detect legal terms and using regular expressions to detect law references. We clustered a case law corpus. The results were evaluated with the Relative Hardness Measure (RH) and the ρ-Measure (RHO). The results were tested both with Wilcoxon's Signed-ranks Test and Count of Wins and Losses Test to determine their significance. The categorization results were evaluated by human specialists. We compared true/false positives against document similarity with the centroid, cluster size, quantity and type of the attributes in the centroids and cluster cohesion. The article also discusses attribute generation and its implications to the classification results.