A genetic algorithm for segmentation and information retrieval of SEC regulatory filings

  • Authors:
  • Joshua Carroll;Thomas Y. Lee

  • Affiliations:
  • University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA

  • Venue:
  • dg.o '08 Proceedings of the 2008 international conference on Digital government research
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A principal mechanism by which the SEC fulfills its missions of investor protection and market efficiency is the widespread dissemination of the information that publicly traded firms submit for disclosure. The continuing evolution of reporting standards like the International Financial Reporting Standards (IFRS) and the global convergence on XBRL as a syntax for sharing data address the quantitative dimension of reporting. This work complements the ongoing research on financial disclosure by helping investors learn from the textual, narrative portions of the filing. Our objective is to automatically segment SEC 10-K financial regulatory filings to facilitate structured retrieval and querying. In structured retrieval, terms are differentially weighted based upon the document segments in which a term appears. We leverage the regulatory instructions provided by the SEC to identify a set of semantic labels such as "Legal Proceedings" or "Management's Discussion and Analysis" that segment a 10-K annual report. We frame the problem of document segmentation as a search for semantic labels and use a genetic algorithm to segment each filing. We evaluate the genetic algorithm on a test set of 112 randomly selected regulatory filings and compare those results to a simple, greedy approach for information extraction and segmentation.