Related Work

Data mining and data analysis libraries

  • The Massive Online Analysis (MOA) library is an extension of the popular WEKA library designed to work efficiently with large-scale data streams. MOA algorithms are particularly well-suited to ADMRE's distributed data-intensive architecture.
  • WEKA is a collection of machine learning algorithms for data mining tasks. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.
  • R is an extremely popular free software environment for statistical computing and graphics.

Data workflow engines

  • OGSA-DAI is the foundation of the ADMIRE software platform, an extensible data workflow engine that supports streaming, distributed queries and a wide number of ETL methods for managing large, complex datasets.
  • Meandre is the workflow engine at the heart of the Software Environment for the Advancement of Scholarly Research (SEASR) project, targeted at supporting data manipulation in the humanities, arts, and social science communities.

Other workflow engines

  • Taverna is an open source and domain independent workflow management system – a suite of tools used to design and execute scientific workflows and aid in silico experimentation.
  • Kepler is free and open source, scientific workflow application, designed to help scientists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines.

Data analysis architectures

  • Falkon is a lightweight task execution framework designed to support data diffusion. Falkon acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data.
  • PerfExplorer is a framework for parallel performance data mining and knowledge discovery. PerfExplorer enables the development and integration of data mining operations that will be applied to large-scale parallel performance profiles.
  • Sector/Sphere supports distributed data storage, distribution, and processing over large clusters of commodity computers, either within a data center or across multiple data centers. Sector is a high performance, scalable, and secure distributed file system. Sphere is a high performance parallel data processing engine that can process Sector data files on the storage nodes with very simple programming interfaces.