Athena: text mining based discovery of scientific workflows in disperse repositories

Authors:
Flavio Costa;Daniel de Oliveira;Eduardo Ogasawara;Alexandre A. B. Lima;Marta Mattoso
Affiliations:
COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Venue:
RED'10 Proceedings of the Third international conference on Resource Discovery
Year:
2010

Citing 22
Cited 0

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Software Engineering: A Practitioner's Approach (McGraw-Hill Series in Computer Science)

Software Engineering: A Practitioner's Approach (McGraw-Hill Series in Computer Science)
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Stemming and lemmatization in the clustering of finnish text documents

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
A differential LSI method for document classification

AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
VisTrails: visualization meets data management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Workflow discovery: the problem, a case study from e-Science and a graph-based solution

ICWS '06 Proceedings of the IEEE International Conference on Web Services
Categorization and analysis of text in computer mediated communication archives using visualization

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A First Study on Clustering Collections of Workflow Graphs

Provenance and Annotation of Data and Processes
Grid metadata management: Requirements and architecture

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
A class-feature-centroid classifier for text categorization

Proceedings of the 18th international conference on World wide web
Experiment Line: Software Reuse in Scientific Workflows

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
WordNet-based text document clustering

ROMAND '04 Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data
Exploiting internal and external semantics for the clustering of short texts using world knowledge

Proceedings of the 18th ACM conference on Information and knowledge management
Benchmarking workflow discovery: a case study from bioinformatics

Concurrency and Computation: Practice & Experience - Special Issue: 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008)
Stop word and related problems in web interface integration

Proceedings of the VLDB Endowment
SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
A provenance-based approach to resource discovery in distributed molecular dynamics workflows

RED'09 Proceedings of the 2nd international conference on Resource discovery
Workflow clustering method based on process similarity

ICCSA'06 Proceedings of the 2006 international conference on Computational Science and Its Applications - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific workflows are abstractions used to model and execute in silico scientific experiments. They represent key resources for scientists and are enacted and managed by engines called Scientific Workflow Management Systems (SWfMS). Each SWfMS has a particular workflow language. This heterogeneity of languages and formats poses as complex scenario for scientists to search or discover workflows in distributed repositories for reuse. The existing workflows in these repositories can be used to leverage the identification and construction of families of workflows (clusters) that aim at a particular goal. However it is hard to compare the structure of these workflows since they are modeled in different formats. One alternative way is to compare workflow metadata such as natural language descriptions (usually found in workflow repositories) instead of comparing workflow structure. In this scenario, we expect that the effective use of classical text mining techniques can cluster a set of workflows in families, offering to the scientists the possibility of finding and reusing existing workflows, which may decrease the complexity of modeling a new experiment. This paper presents Athena, a cloud-based approach to support workflow clustering from disperse repositories using their natural language descriptions, thus integrating these repositories and providing a facilitated form to search and reuse workflows.