Data-Intensive Distributed Computing

Image courtesy The Planet/flickr

Data-intensive science is the fourth paradigm of scientific discovery. Data-intensive distributed computing (DIDC) is a marriage of data-intensive scientific methods with distributed computing in an attempt to define and build computational infrastructure that can support the increasing volumes of digital data innundating today's researchers.

The ADMIRE architecture

The ADMIRE architecture envisages a wide range of tools coupled to a powerful, extensible enactment framework through a controlled, canonical interaction point called a Gateway.

Domain experts and data analysis experts work together at the tools level to devise ways to extract business information from data. Data analysis experts use the DISPEL language to develop canonical workflows and the necessary algorithmic enactors or processing elements that can be chained together to implement the required solution. DIDC experts implement and support the machinery of the Gateway and its execution environment.

The Gateway

ADMIRE's architecture implements the separation of concerns using a conceptual hourglass.

The top bulb is a creative domain where concepts are dynamic and ever changing and therefore the humans in this space need support from tools.

In the neck of the hourglass, creativity should not be present. We need a stable means of discourse between the analysts, developers and the enactment platform. For data-intensive computing, the DISPEL language acts as the stable means of discourse.

In the bottom bulb we again need creativity. The complexities of mapping a DISPEL sentence onto distributed, physical computational resources an data sources requires many type of expertise and signi´Čücant levels of automation, such as optimization.

The DISPEL language

The DISPEL language provides the core power of the ADMIRE approach, allowing data analysis experts to describe complex data-intensive workflows in a stable, canonical way which allows for change both in the tools used to create them and in the platforms that enact them.

A DISPEL primer

DISPEL is a scripting language which is processed by a DISPEL parser to generate data-flow graphs in the form of executable workflows. The primary function of DISPEL is to express how a data-mining application uses several data-mining activities (computational processes, for instance, services that provide noise filtering algorithms), and how each of these activities communicate with one another. In other words, DISPEL is a language for expressing a directed graph, where processing elements represent the computational nodes and the flow of data between processing elements are represented as connections. Thus, DISPEL provides an abstraction technique for streaming-data execution model. At the lower level, DISPEL also handles validation, and provides the required model for carrying out workflow optimisations. We shall discuss this shortly, but first, an overview of the language components.

DISPEL uses a notation similar to Java. The following is a representative example:

package eu.admire {
  Type SQLtoTupleList is PE(
    <Connection: String:: SQLqueryStatement expression> =>
    <Connection: [<rest>]:: RelationalResult data>
  SQLtoTupleList lockSQLdataSource(String dataSource) {
    SQLquery sqlq = new SQLquery;
    |- repeat dataSource enough -|:String::RDBuri => sqlq.resource;
    return PE( <Connection expression = sqlq.expression> =>
      <Connection data => );
  register lockSQLdataSource;

package eu.admire {
  use eu.admire.lockSQLdataSource;
  PE(<Connection: String:: SQLqueryStatement expression> =>
    <Connection: [<rest>]:: RelationalResult data> )
  SQLonA = lockSQLdataSource("");
  PE(<Connection: String:: SQLqueryStatement expression> =>
    <Connection: [<rest>]:: RelationalResult data> )
  SQLonB = lockSQLdataSource("");
  register SQLonA, SQLonB;

In the example above, we are registering reusable components (e.g., lockSQLdataSource) with the registry, which are then reused to register derived reusable components (e.g., SQLonA and SQLonB). These components are finally used in executable applications, as shown in the following example:

package eu.admire {
  use eu.admire.Converter;
  use eu.admire.SQLonA;
  use eu.admire.TuplesToObservationTrack;
  use eu.admire.Results;   
  SQLonA sqlona = new SQLonA;
  String q1 = "SELECT * FROM AtlanticSurveys" +
    " WHERE before '2005'" +
    " AND after '2000'" +
    " AND AtlanticSurveys.latitude >= 0";
  |- q1 -| => sqlona.expression;
  Converter c = new TuplesToObservationTrack; => c.input;
  Results results = new Results;
  |-"North Atlantic surveys 2000 to 2005"-| =>;
  c.output => results.input;
  submit results;

In the above example, we reuse SQLonA by importing them from the registry. We then specialise this component to fit the survey processing requirement. This is then submitted to the resources for eventual execution.

A diagrammatic representation of the above example is shown here:

Type system

Compared to other data-flow expression languages, DISPEL introduces a sophisticated type system for validation and optimisation. Using this type system, we are not only capable of checking the validity of a DISPEL sentence (e.g., incorrect syntax), but also validate the connection between processing elements (e.g., incorrect connections where the type of output data produced by a source processing element do not match the type of input data expected by a destination processing element). Furthermore, this type system exposes the lower-level structure of the streaming-data that are being communicated through valid connections, so that workflow optimisation algorithms implemented at the enactment gateways can re-factor and reorganise processing elements and their interconnections to improve performance.

The DISPEL language uses three type systems to validate the abstraction, compilation, and enactment of DISPEL scripts. The nomenclature for these type systems are:

  • The language type system statically validates at compile-time if the operations in a DISPEL sentence are properly typed. For instance, the language type checker will check if the control variables in a loop iterator, or indexes of an array, are of the correct type, and that the parameters supplied to a function invocation match the type of the formal parameters as specified in the function's definition.
  • The structural type system describes the format and low-level (automated) interpretation of values that are being transmitted along a connection between two processing element instances. For example, the structural type system will check if the data flowing through a connection is a sequence of lists of tuples, etc.
  • The domain type system describes how application-domain experts interpret, or abstract, the data that is being transmitted along a connection in relation to the application at hand. For instance, the domain type system will describe if the data flowing through a connection is a sequence of arial or satellite image stripes, each stripe being represented as a list of, say satellite images.

A DISPEL manual

The DISPEL Manual covers the features and use of the the language in greater detail.