We've working with Cloud computing since before this term was popularized. For us there are five major elements of Cloud computing:

  • Distributed computing models to provide reliable services running across multiple locations with multiple failure scenarios
  • Horizontal scalability of both application and infrastructure.
  • Agile Infrastructures to dynamically support the scale-up/scale down requirements characteristic of Cloud computing.
  • Charge back and pricing models to determine whether to build and run your own private Cloud or utilize external service offerings such as caged or public Cloud
  • Onboarding and often overlooked but even more important to users is outboarding

Cloud Programming

There are two important considerations when programming for the Cloud. The first is the massive scale, the second is the need to create reliable services using unreliable components. If you know you are designing and deploying for a million Users, it's a challenge but solvable. The problem with Cloud is that in capacity terms it's an unconstrained user population. The financial reality is that few are able to afford the initial cost of a system capable of handling this huge numbers when they startup. The software design needs to be horizonatally scalable so that additional infrastructure can be added as the User population grows and money starts to roll in. This scalability requirement creates a big challenge for the persistence where traditional SQL database implementations just don't scale to these levels. We've now gained a lot of experience with noSQL solutions. There are many to chose from and we can help you make the right choice for your requirements.

The other aspect of scalability is the ever increasing number of cores on a modern server. It's getting increasing difficult to utilize these with conventional programming languages such as Java, C# or C++. The threading and mutex's soon become a nightmare to code and to debug. We are finding that the concurrency of Erlang is a good fit for Cloud and we now use this by choice. There are also an increasing number of open source components developed in Erlang which we are able to leverage. Erlang OTP is a platform designed for highly reliably systems, designed originally by Ericsson for its telephone switches it has now become a popular choice for cloud applications.

Everybody wants their Cloud service to be a success with millions of Users. To make this affordable requires a lot of infrastructure. At these scales a small cost saving on each server can make a big difference. This is the opposite of the traditional Enterprise model where they still gold plate their servers rather than improve their applications failure resistence. This application reliability and resilience has to be designed in from the beginning.

Maintenance and Upgrades are also a challenge. There is never a good time to shut down your Cloud service to roll out a new upgrade. This has big implications on how you design both the infrastructure and the application. We've worked with some of the biggest names on solutions to this problem.

Cloud Infrastructure

Successful Cloud solutions rely on multiple components all efficiently communicating. As we came from a background of building large HPC grids we have applied many of these same techniques to Cloud infrastructure. As we get more and more powerful computers, increasingly the bottleneck on these Cloud clusters is the communication fabric that interconnects them just as it is for the large HPC grids. We have had enormous success using InfiniBand for this. It's now running at 56Gb/s, is lower latency and has a lower cost per provisioned port than 10G Ethernet. It incorporates RDMA engines and provides packet error detection and recovery in hardware, avoiding the overhead of TCP/IP. RDMA allows it to achieve faster wire speeds and the frees up the host CPU's to get on with running application code.

We have used InfiniBand as a Universal Interconnect, running intercommunication and storage protocols across it. It also supports IPoIB for backward compatibility for programs which do not have native VERB support and gateways to Ethernet and FibreChannel. Native VERB support is already available for MySQL, memcached, RAMCloud, Hadoop, Qpid, Oracle Exadata, Oracle Coherence, Lustre, GPFS as well as VMware and is standard in the Linux kernel.

There are long haul InfiniBand switches available that support 1000's of miles and transfer data at higher rates than TCP/IP. We've used these ourselves to achieve wire speed transfer between InfiniBand systems across hundreds of miles.

InfiniBand storage options include SCSI RDMA Protocol (SRP and iSER to access block level storage, NFS/RDMA and Windows SMB for file level access. Fibrechannel and FCoE storage can be accessed through gateways whilst direct access InfiniBand storage is available from DDN, EMC and IBM.

Cloud Management

Clouds need to be able to support on-demand scale-up and scale-down and User portals for instant configuration. These both require high degrees of automation. The larger the environment, then the easier it is to justify the cost of developing this automation. This automation allows the cloud providers to fully leverage their scale adding to the existing advantages of lower (per unit) procurement costs and Data centre facilities. As these Cloud providers improve this automation it will become increasingly difficult for an Enterprise to offer the same level of service for the same price. We anticipate more and more organizations will migrate to the Cloud, or leverage the same technologies to deploy their own private Clouds.

What is often overlooked though is that it is only possible to automate when the items are well described. High levels of automation require far richer levels of configuration information than exists in the typical Enterprise. Mostly, their configuration management today is used for Asset Management and reporting. This has to be looked at as the basis for automation if they are to compete with the Cloud providers.

We have a Data Centre model written in UML which describes the entire data centre infrastructure and a configuration description of each application. This has been designed to support the automation and replication of the data centre. The model itself can be realized within a CMDB.

Onboarding and Outboarding

There is now a healthy focus around onboarding from all the major Cloud providers however very little is being said about outboarding, this is being able to migrate your data and applications back out of a particular Cloud provider. We would argue that this is even more important to Users inorder to prevent them becoming trapped with a single supplier or worse, left with nothing if their supplier drops the service offering or goes bust. Whilst some Cloud providers may be considered too big to fail, these larger providers frequently switch direction and move in and out of ventures with little concern for their now dependent User base.

Virtualization and the Cloud

Related to Cloud computing is the area of Virtualization. This is often viewed as an enabler any many organizations are building private Cloud like infrastructure by leveraging virtualization technologies. The first step is to add a self service portal so that Users can create their own VM's. Remember to establish some life cycle management to these, otherwise you will find yourself overloaded by VM's. We recommend a lease based approach, where the VM can be removed if the User fails to respond to renewal requests.

As organizations strive to virtualise an ever bigger slice of the server estate we are finding that I/O is becoming the critical resource to manage. Until now the focus has been on CPU and memory resource management, but now higher I/O applications are being targeted, along with fatter servers to provide plenty of CPU and memory, I/O has become the constraint. We carried out some analysis on the various technologies available in this area to help address this. These include iSCSI and FCoE over 10G Ethernet, FibreChannel HBA management and Virtual I/O across InfiniBand. Take a look at the paper I presented on this topic at the 2009 Syscon conference in Prague.

Some of the projects we have completed are:

  • Iterim CTO for a Cloud startup. Carried out the solution design, engineering lead and helped create the VC portfolio for a large scale messaging system. We cannot say much about what it does until it comes out of stealth mode, but it incorporated Erlang, RabbitMQ, Riak, Couchbase, Mnesia, Yaws, JavaScript, Ext.js HTML5.
  • Led the architecture and supported the implementation team for a large scale Cloud based Grid offering secure, dynamically sized and personalized Grid environments.
  • Developed a normalization approach for chargeback of a private Cloud offering in a major Bank enabling them to normalize across different hardware platforms for this x86 and SPARC based virtual hosting platform. This provided a standardized unit of charge and control independent of processor architecture, number of cores or clock speed.
  • Developed the Agile Infrastructure maturity model and assessment approach for Citihub which was used at a number of major Banks.
  • Developed a UML model of the entire Data Centre infrastructure and software build to enable automation for a major Cloud provider.
  • Worked extensively with Secure Enterprise and public Email services including X.400 and X.500 based systems. Developed gateways and custom User agents from X.400 to legacy mail and telephany systems.

Technology watchlist

Conference Presentations

Richard is a frequently invited speaker at Cloud related conferences, these have included: