A genetic algorithm for segmentation and information retrieval of SEC regulatory filings

Authors:
Joshua Carroll;Thomas Y. Lee
Affiliations:
University of Pennsylvania, Philadelphia, PA;University of Pennsylvania, Philadelphia, PA
Venue:
dg.o '08 Proceedings of the 2008 international conference on Digital government research
Year:
2008

Citing 10
Cited 0

Probabilistic and genetic algorithms in document retrieval

Communications of the ACM
Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Integrating contents and structure in text retrieval

ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Automata Theory, Languages, and Computation (3rd Edition)

Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Genetic Programming-Based Discovery of Ranking Functions for Effective Web Search

Journal of Management Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A principal mechanism by which the SEC fulfills its missions of investor protection and market efficiency is the widespread dissemination of the information that publicly traded firms submit for disclosure. The continuing evolution of reporting standards like the International Financial Reporting Standards (IFRS) and the global convergence on XBRL as a syntax for sharing data address the quantitative dimension of reporting. This work complements the ongoing research on financial disclosure by helping investors learn from the textual, narrative portions of the filing. Our objective is to automatically segment SEC 10-K financial regulatory filings to facilitate structured retrieval and querying. In structured retrieval, terms are differentially weighted based upon the document segments in which a term appears. We leverage the regulatory instructions provided by the SEC to identify a set of semantic labels such as "Legal Proceedings" or "Management's Discussion and Analysis" that segment a 10-K annual report. We frame the problem of document segmentation as a search for semantic labels and use a genetic algorithm to segment each filing. We evaluate the genetic algorithm on a test set of 112 randomly selected regulatory filings and compare those results to a simple, greedy approach for information extraction and segmentation.