Test collection management and labeling system

Authors:
Eunyee Koh;Andruid Kerne;Sarah Berry
Affiliations:
Adobe Systems Inc., San Jose, CA, USA;Texas A&M University, College Station, TX, USA;Texas A&M University, College Station, TX, USA
Venue:
Proceedings of the 9th ACM symposium on Document engineering
Year:
2009

Citing 5
Cited 1

Modern Information Retrieval

Modern Information Retrieval
Learning important models for web page blocks based on layout and content analysis

ACM SIGKDD Explorations Newsletter
Efficient summarization-aware search for online news articles

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
TableSeer: automatic table metadata extraction and searching in digital libraries

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A concise XML binding framework facilitates practical object-oriented document engineering

Proceedings of the eighth ACM symposium on Document engineering

Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In order to evaluate the performance of information retrieval and extraction algorithms, we need test collections. A test collection consists of a set of documents, a clearly formed problem that an algorithm is supposed to provide solutions to, and the answers that the algorithm should produce when executed on the documents. Defining the association between elements in the test collection and answers is known as labeling. For mainstream information retrieval problems, there are publicly available test collections which have been maintained for years. However, the scope of these problems, and thus the associated test collections, is limited. In other cases, researchers need to build, label, and manage their own test collections, which can be a tedious and error-prone task. We were building test collections of HTML documents, for problems in which the answers that the algorithm supplies is a sub-tree of the DOM (Document Object Model). To lighten the burden of this task, we developed a test collection management and labeling system (TCMLS), to facilitate usability in the process of building test collections, applying them to validate algorithms, and potentially sharing them across the research community.