Probabilistic and genetic algorithms in document retrieval
Communications of the ACM
Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Relational learning of pattern-match rules for information extraction
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Integrating contents and structure in text retrieval
ACM SIGMOD Record
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Maximum Entropy Markov Models for Information Extraction and Segmentation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Mining reference tables for automatic text segmentation
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Introduction to Automata Theory, Languages, and Computation (3rd Edition)
Genetic Programming-Based Discovery of Ranking Functions for Effective Web Search
Journal of Management Information Systems
Hi-index | 0.00 |
A principal mechanism by which the SEC fulfills its missions of investor protection and market efficiency is the widespread dissemination of the information that publicly traded firms submit for disclosure. The continuing evolution of reporting standards like the International Financial Reporting Standards (IFRS) and the global convergence on XBRL as a syntax for sharing data address the quantitative dimension of reporting. This work complements the ongoing research on financial disclosure by helping investors learn from the textual, narrative portions of the filing. Our objective is to automatically segment SEC 10-K financial regulatory filings to facilitate structured retrieval and querying. In structured retrieval, terms are differentially weighted based upon the document segments in which a term appears. We leverage the regulatory instructions provided by the SEC to identify a set of semantic labels such as "Legal Proceedings" or "Management's Discussion and Analysis" that segment a 10-K annual report. We frame the problem of document segmentation as a search for semantic labels and use a genetic algorithm to segment each filing. We evaluate the genetic algorithm on a test set of 112 randomly selected regulatory filings and compare those results to a simple, greedy approach for information extraction and segmentation.