ADMIRE's Vision

Context

ADMIRE is pioneering architectures and models that will deliver a coherent, extensible and flexible framework to facilitate much better use of a wide range of heterogeneous distributed data resources. ADMIRE will deliver an integrated approach to enable fluent exploration and exploitation of data. That integration encompasses the full path of actions from accessing the source data to delivering derived results. That path includes gaining permission to access data, extracting selected subsets of data, cleaning and transforming data, performing analyses and preparing results in the form required by their destination. As the multiple sources of data encode and organise information in different ways a significant element is data integration, which includes conversions at both syntactic and semantic levels in order to enable data composition.

ADMIRE uses the term Data Mining and Integration (DMI) to refer to all aspects of this full path of exploration and exploitation. It refers to traversals of such paths, by the users, experts and computer systems, as DMI processes. ADMIRE will support, model and automate these DMI processes and present the newly coherent functionality through two views. Firstly, tailored portals accessing specialised DMI tools for domain experts will improve the accessibility and exploitation of data in specific domains. Secondly, an integrated set of tools presented as a work bench for data mining and integration experts (DMI experts) will improve their productivity and accelerate the rate at which they are able to deploy new methods and applications, and integrate these into tailored portals for the domain experts. ADMIRE will test the hypothesis that this separation of two coupled views of DMI processes will be feasible and effective. This in turn depends on improving the data-aware distributed computing platforms that support the work of the two communities of experts and the enactment of DMI processes. Therefore, ADMIRE engages with the requirements of three communities: the two communities of experts and the engineers who build data-aware computing platforms.


A data-rich environment

Today's data-rich environment, with a growing commitment to the effective exploitation of data, leads to ADMIRE's vision that future DMI architectures must simultaneously address a number of sources of scale and complexity. The following list is indicative of that multi-dimensional challenge and ADMIRE's strategic response:

  • The scale and complexity of each data source grows. ADMIRE addresses this with data-flow technology to reduce data handling and to move data reduction and transformation operations closer to data sources.
  • The number and variety of data sources is growing. ADMIRE addresses this by proposing dynamic composition processes as warehousing and static global schema are infeasible.
  • The computational complexity of extracting information grows as a result of the above and of increasingly sophisticated requirements. ADMIRE addresses this by enabling the work of data-aware distributed computing engineers.
  • The number of application domains using DMI grows, becomes more diverse and engages more users. ADMIRE addresses this by recognising communities of users, by supporting them with their own environments and by delivering packaged production versions of DMI processes.
  • The number of experts involved in developing new DMI processes and supporting application domains grows. ADMIRE addresses this by separating support for DMI experts from that for DADC engineers and application-domain users.
  • The number of providers of data and DMI services grows. ADMIRE separates the organisation of environments for DMI process development from the complexities of DMI service provision by interposing DMI gateways using a canonical language.
  • The growing sophistication of information extraction from large bodies of data requires ever more complex and refined workflows. ADMIRE addresses this by structuring the predefined components into libraries that correspond to a conceptual structure captured in ADMIRE ontologies and supports the incremental refinement of libraries and the DMI processes that use them. It allows greater contemporaneous effort by supporting concurrent independent development by three separate categories of experts working both for providers and users.
  • The providers of data and services autonomously change their offered services and schema at a rate which defies manual adaptation when many resources are in use. ADMIRE proposes that this should be addressed by exploiting type systems, semantic description, community effort and light-weight composition to semi-automatically adapt to change and to pool the intelligence of human interventions.

ADMIRE's outputs

The ADMIRE project aims to significantly improve the exploitation of data by delivering three categories of output: a framework, an architecture and a set of use cases that illustrate how they can be used to improve DMI. These will be built on a consistent set of principles that will emerge and be validated by the project's research; for example:

  • The partitioning of concerns based on a stratification of interests with a variety of application domains on top, an intermediate layer containing the various experts in data integration and data mining and a foundation of distributed systems engineers that build, operate and optimise the computational services.
  • The eventual scale and diversity of DMI activity will mean that globally consistent services will be undesirable and unachievable due to concurrent autonomous change.
  • The scale and diversity of activity will benefit from increasing the independence between DMI-process developers and DMI-service developers.
  • Communities of interest within application domains will need to be supported, e.g. their domain-specific prevalent standards will need to be honoured and their independent vocabularies, ontologies and developed DMI-processes supported.
  • Support for controlled sharing of information is essential.

Within ADMIRE these high-level principles are explored in more detail as they are applied in the specific explorations of the project. Within the scope of the project it is not possible to explore their full generality. They are discussed as they arise within the sub-topics covered below.

The ADMIRE framework will include in prototypical form:

  • a coherent and consistent model in which data exploration, data integration and data mining are integrated and well supported;
  • mechanisms for extracting information from such integrated data by accommodating a wide variety of data-analysis methods and (legacy) services;
  • a robust and efficient underpinning distributed-computing framework that accelerates the integration and interpretation of data from multiple autonomous data resources;
  • tools that support data mining and knowledge engineering experts as they develop strategies, algorithms and workflow patterns to extract information or test hypotheses against integrated data;
  • an expansion of the range of domain experts who successfully exploit the data, achieved by delivering easily understood tools, so that domain experts may use the strategies and workflow patterns developed by DMI experts to discover new evidence, inform policies and analyse previous behaviour; and
  • a distributed computing environment that expects and handles the wide variety of changes that are encountered as data resources are re-organised and evolve their services in response to advances in technology or in their business and research activities.

The ADMIRE architecture will provide guidelines, validated through prototypes and preliminary evaluations, for those data-aware distributed-systems engineers who will build production versions of the tools, services and computing environments to deliver DMI for widespread professional use. The architecture is intended to guide the construction and operation of the full computing environment needed for DMI by supporting all aspects of the DMI framework. The ADMIRE architecture will include:

  • a separation, via a canonical form for DMI process definition, between a diverse and extensible domain of DMI tools and a range of DMI-enactment platforms;
  • a model for describing all the components participating in DMI processes that supports the range of DMI tools, DMI-enactment optimisation and automated adaptation to changes in data sources and services;
  • DMI gateways that mediate requests in the canonical form and hide the transformations, delegation and heterogeneity of the distributed underlying data resources, DMI processing elements and services; and
  • efficient direct data paths for delivering results and monitoring DMI-process enactment.

In ADMIRE the use cases will be derived from usage scenarios where the project members have the relevant expertise and where the use cases provide significant challenges to test prototypes of ADMIRE's envisaged systems. The usage scenarios which generate these challenges are listed below; key ones are described in WP6.

  • Customer relationship management. This usage scenario is in the context of the mobile phone market, where it is vital to understand why customers change their subscription plans or move to another provider. It is in led by Comarch who have experience of providing customer relationship management (CRM) for several mobile phone providers.
  • Gene expression in the developing embryo. This usage scenario involves data acquired from high-throughput studies in the context of understanding the normal embryonic development of the house mouse (Mus musculus) in terms of spatial gene expression patterns. To make these patterns useful to researchers world-wide, they need to be annotated to allow searching for appropriate images. It is led by the National e-Science Centre (NeSC) team using data provided by colleagues in the Medical Research Council (MRC) Human Genetics Unit (HGU) in Edinburgh.
  • Non-invasive vital health parameter prediction. The context of this usage scenario is an investigation into the ability of methods from traditional Chinese medicine (TCM) to estimate vital indicators, such as blood pressure and blood sugar and to support diagnosis and health monitoring. It is led by the team from Vienna who are working in collaboration with medical researchers in Beijing.
  • Environmental factors affecting river management. This usage scenario integrates data relating to several environmental phenomena, e.g. precipitation, land use, land saturation, reservoir operation, etc., that influence the management of a river and then uses data mining to predict several important parameters about potential risk of flood. The results need to be presented in map form. This is led by the Institute of Informatics, Slovak Academy of Sciences (IISAS).
  • Common factors in data-centre operational incidents. This usage scenario takes operational logs from several data centres; its main goal is to avoid or mitigate future operational incidents. This is led by Fujitsu Laboratories Europe (FLE).

ADMIRE's impact

The principles, architecture, model, languages, implementation strategies, emerging best practice and a representative set of use cases, will be developed as a book that should become available soon after the end of the ADMIRE project. This will build on the scientific papers published during the project.

Improving the use of data as a means of better informing decisions and planning is essential for the profitability of businesses, for response to emergencies, for strategic political decision formulation and for long-term health and wellbeing. ADMIRE expects to enable such improvement. The methodology pioneered in ADMIRE should prove widely applicable as an organisational strategy for using computational analysis of the many faceted data that are collected about complex systems in order to advance our understanding of such systems.

With reference to the OECD's 2007 report OECD Principles and Guide Lines for Access to Research Data in Public Funding, other strategies for accommodating the growing wealth of data in a common but flexible framework are based on data spaces (see Franklin and Howe), on dynamically integrating data and on distributed data mining algorithms (see Syed, Grossman or Dubitzky's Data Mining Techniques in Grid Computing Environments [Wiley-Blackwell]). These all show the increasing interest in addressing the scale, heterogeneity and diversity issues and all contribute to aspects of ADMIRE's research.

...making data mining easier