Data-Intensive Analytics

Image courtesy arnybo/flickr

ADMIRE offers a strategy and distributed architecture to assist data analysis experts in making sense of the increasingly large and complex world of digital data.

Separation of concerns

ADMIRE identifies three types of data user:

  • Domain experts who hold the business perspective for a given data-intensive problem. They need answers to domain-specific questions without having to invest significant effort in development.
  • Data analysis experts specialise in extracting information from data. They devise the strategies to answer the domain experts' questions.
  • DIDC experts (data-intensive distributed computing experts) understand the best ways of harnessing computing resources to build data-intensive processing platforms.

ADMIRE separates the concerns of these users, providing familar tools to the domain experts, a standard canonical approach to describing data-intensive processes for the data analysis experts, and a comprehensive and efficient enactement framework for the DIDC experts.

In this way, ADMIRE divides the challenge of designing and implementing any large-scale data analysis process into three - nearly independent - parts:

  1. the design and specification of data-intensive processes in high-level graphical and DISPEL notations;
  2. the specification and description of processing elements as user-defined code; and
  3. the specification of patterns that compose processing elements and deliver fast ways of generating powerful and frequently used data-intensive processes.

An Example: Paul's Challenge

Paul has been recently promoted to a customer manager position in a mobile telecommunications company, and has to take care of understanding why the company customers stay or quit the company.

How to do it?

Data access, data integration

Paul decides that he will base his decisions almost exclusively on the data that the company has about its customers. Consequently, he will have to understand the available data, filter it and process it for the purpose of understanding his customers' behaviours and preventing them from moving to another mobile phone operator.

Fortunately, the data he needs is available in the company's databases, including information about users, types of tariffs, call records, contracted services, etc. Some of these databases are quite large and in some of them the information is messy. There's also a large amount of information that isn't relevant to Paul's immediate concerns, so he will need to filter them, selecting the most appropriate attributes, generating appropriate summaries, and applying adequate data mining algorithms for this purpose. This has to be done in an iterative way, where the filters and the algorithms applied will be changed and adapted according to Paul's changing needs.

Adaptation to multiple sources

Several months later, Paul's company merges with another one, with similar types of information about their users, some of them probably overlapping. The two companies' sets of databases are, of coure, different - in organisation, detail, quality, size - in all possible ways. The complexity of running Paul's data analysis algorithms suddenly increases and, given the larger number of records that these algorithms have to deal with now, their performance slows.

Integrating public data sources for business advantage

On top of this, Paul wants to go beyond the current state of the art in many mobile phone companies, and is considering the possibility of trying to connect mobile phone customer records available in the company databases with information from the social networks of each customer and the relationships between the calls that they make to each other. This would allow the company to offer special deals for groups of people (e.g., very cheap calls among all the members of a group).

Paul needs ADMIRE.

A Structured Approach to Data Analytics

By taking a structured and methodical approach, Paul can solve his business problems in a way that can grow and change easily as his requirements and environment change.

By thinking in terms of data processing elements connected together into data workflows Paul can build a solution which is not only reusable but is easily adaptable to new data.

Beginning with the hypothesis that the business intelligence to identify customers likely to leave (to "churn") is contained somewhere within the Tariffs and Communication databases, Paul develops his first solution.

Now, to deal with the company merger, Paul must extend his solution. He must also manage the fact that the "new" company stored its call data in two sources, a "calls out" database and a "calls in" database.

However, his structured approach has payed off, and his new solution isn't so very far away from a simple splicing of two copies of his old one.

Through the application of structured design methods for data-intensive analytics, Paul has developed a powerful predictive engine that can be deployed in a live environment to predict patterns of churn for customers from both sides of the merged company.

The ADMIRE advantage

ADMIRE advocates just this approach to data analytics, and provides powerful language, description and enactment tools to implement the necessary framework to provide Paul and others like him with a means to manage the complexity of modern data analysis.