When big data leads to lost data

Authors:
V. M. Megler;David Maier
Affiliations:
Portland State University, Portland, OR, USA;Portland State University, Portland, OR, USA
Venue:
Proceedings of the 5th Ph.D. workshop on Information and knowledge
Year:
2012

Citing 21
Cited 1

The relevance of recall and precision in user evaluation

Journal of the American Society for Information Science - Special issue: relevance research
Real life, real users, and real needs: a study and analysis of user queries on the web

Information Processing and Management: an International Journal
Performance of inverted indices in shared-nothing distributed text document informatioon retrieval systems

PDIS '93 Proceedings of the second international conference on Parallel and distributed information systems
Data and Metadata Collections for Scientific Applications

HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Finding Geographic Information: Collection-Level Metadata

Geoinformatica
Searching with Numbers

IEEE Transactions on Knowledge and Data Engineering
Automatic Information Organization and Retrieval.

Automatic Information Organization and Retrieval.
The relationship between IR effectiveness measures and user satisfaction

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Optimized query execution in large search engines with global page ordering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A survey of top-k query processing techniques in relational database systems

ACM Computing Surveys (CSUR)
Introduction to Information Retrieval

Introduction to Information Retrieval
A case study of distributed information retrieval architectures to index one terabyte of text

Information Processing and Management: an International Journal
How does search behavior change as search becomes more difficult?

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Indexing multi-dimensional data in a cloud system

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Can shared-neighbor distances defeat the curse of dimensionality?

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Overview of the INEX 2009 entity ranking track

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Efficient Metadata Generation to Enable Interactive Data Discovery over Large-Scale Scientific Data Collections

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment
Find it if you can: a game for modeling different types of web search success using interaction data

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Finding haystacks with needles: ranked search for data using geospatial and temporal characteristics

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
Navigating oceans of data

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management

PIKM 2012: 5th ACM workshop for PhD students in information and knowledge management

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

For decades, scientists bemoaned the scarcity of observational data to analyze and against which to test their models. Exponential growth in data volumes from ever-cheaper environmental sensors has provided scientists with the answer to their prayers: "big data". Now, scientists face a new challenge: with terabytes, petabytes or exabytes of data at hand, stored in thousands of heterogeneous datasets, how can scientists find the datasets most relevant to their research interests? If they cannot find the data, then they may as well never have collected it; that data is lost to them. Our research addresses this challenge, using an existing scientific archive as our test-bed. We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and "semi-curated" methods to extract metadata from large archives of scientific data. We then perform searches over the extracted metadata, returning results ranked by similarity to the query terms. We briefly describe an implementation performed at an ocean observatory to validate the proposed approach. We propose performance and scalability research to explore how continued archive growth will affect our goal of interactive response, no matter the scale.