There seems to be an insatiable limit to our ability to create more data. Whilst storage densities have increased at a similar pace to Moore's law with 8TB disk drives now readily available, the access rates per device have only marginally improved over the last 20 years. The Big Data challenge is how best to design these systems to maximize performance, how to make this data useful, how best to find the needle in the haystack, and how to discover the new insights which become possible as a result of the new large collections.

We can bring to bear years of experience dealing with high performance storage solutions to optimize system performance. From what you seee in the press you could think that FLASH storage is the answer to everything. We found that whilst random I/O is improved, sequential throughput and sequential, write latency are often worse due to the block erase requirements of FLASH. The increased I/O rate also exposes bandwidth constaints inherent in most FLASH arrays. We worked with a number of storage and server vendors to optimize the storage within servers, using PCI and now NVME based FLASH storage.

We are already heavilly involved with Cloud, Grid and noSQL which all have a roll to play with Big Data.

Application Design for Big Data

Hadoop is an ecosystem rather than a single App. After Google published papers on MapReduce but kept the actual implementation private, Doug Cutting at Yahoo created what was to become the Apache Hadoop project. It is a Java implementation and is optimized for the big science type of calculations that take hours-days to run on TB-PB size data sets. It is entirely Java based and most Hadoop programs are composed in Java, with the programmers spending considerable time to develop and optimize these programs. Hadoop streams does allow scripts to replace Java but does not permit the same degree of optimization as is achievable with Java. Pig Latin provides a more natural scripting language.

A principal of Hadoop is to take computing to the data. It stores the data in HDFS which partitions and replicates the data across the cluster. Hadoop itself has been optimised for jobs where the size of the data sets is much larger than the aggregate memory capacity across the cluster, requiring each step of the MapReduce to save to disk

The focus with Hadoop is now very much developing and using new Query engines. Hadoop itself comes with Hive. Independent effort from Facebook has created Presto which is ANSI SQL compatible, Cloudera have Impala and Dremel is a reversed engineered implementation of Google's BigQuery API

Our experience had been is that there are actually very few problems in the enterprise environment that have datasets large enough to demand traditional Hadoop (e.g. you need less than 200TB to store all the tweets posted on Twitter for an entire year!). Derivative projects such as Spark aim to optimize for more interactive and smaller problems. We believe these are better options for most commercial organizations. They each leverage the Hadoop cluster infrastructure but provide query and processing support better suited to the type of problems we encounter in the commercial world. This is rapidly changing set of tools. We help clients understand the choices and embark upon a Hadoop strategy which delivers results quicker.

Hadoop Cluster Design

All Hadoop jobs require the correct balance between compute and the bandwidth of memory, I/O channels and storage. If these are not optimized, the Hadoop job may spend more time fetching and saving the data than actually computational work. This is exaggerated in the commercial world where run time tends to be shorter and datasets smaller

We apply design principles that have been proven deploying some of the largest compute grids. We are able to leverage this experience to create some of the fastest and most cost effective big data platforms available.

Hadoop HDFS, like noSQL manages reliability and resiliency within the cluster configuration. This allows cheaper, faster internal disks and SAS/SATA JBOD arrays to be used instead of traditional enterprise storage solutions.

Some BigData applications may also benefit from the increased bandwidth and lower latency of InfiniBand or RDMA over Ethernet such as RoCE and iWARP. We are one of the few independent groups that can provide unbiased advice in this area, having had extensive experience with RDMA technologeis for many years. We find many are surprised at the fact that it is often cheaper to build a InfiniBand based cluster than to use Ethernet

Low Cost Storage

As drive capacities continue to increase, traditional data protection schemes which apply at the drive level such as RAID disk protection start start to fall down due to the longer rebuild times and the increasing prossibility of uncorrectable errors ocuring during the rebuild of TB capacities. Appling RAID techniques on smaller units are still viable but this then requires integration with the block allocation schemes and must therefore be built into the Operating system. As most storage arrays are built now from servers running Linux, Windows or FreeBSD having this capability built in makes sense.

This is exactly what the Ceph project is seeking to achieve. It allows low cost disk to be managed as a pool across a cluster, with blocks being allocated and replicated across different servers in different racks. Linux kernel support is included in all major distrobutions to allow access to these RADOS blocks by the kernel where they can be used to create local filesystems.

Other Ceph services use these blocks to present object storage using SWIFT or S3 APIs and a shared fllesystem with CephFS.

After testing Ceph, we currently recommend RADOS block storage but found too many problems with CephFS to be able to recommend its use with the current release. Instead we are providing file access using NFS gateways which are creating local filesystems on RADOS blocks. We are still testing object storage.

Ceph is limited when it comes to the fast data ingest requred for use cases such as virtual tape. The writers will only map to a few OSDs, as these become full, Ceph will start to rebalance and migrate content to other OSD's. This is happening at the saem time as OSD replication is taking place. The write throughput is therefore only gong to be a fraction of the cluster network, say 20%, with a 10G cluster network only about 200MB/s single writer ingest could be expected. We plan to test with 25G, 40G and 100G cluster interconnects. Also the long awaited RDMA support with Ceph has the potential to significantly improve this.

Some of the projects we have worked on are:

  • Ceph POC and evaluation
  • Filesystem performance evaluation and test of candidates for BigData stores including BTRFS and ZFS.
  • FLASH device and NVME performance test and selection
  • InfiniBand to FibreChannel gateway evaluation and test.
  • RDMA throughput tuning and testing.
  • Designed and delivered the first SAN boot environment for Sun Solaris into what was the largest commercially deployed FibreChannel SAN at the time.
  • Former a member of the IEEE Mass Storage community which concentrated on high performance and hierarchical storage solutions for Supercomputer sites.
  • Unitree (hierarchical storage manager) Product managment, consultancy and testing for a major hardware vendor.

Technology watchlist

Conference Presentations

Richard is a frequently invited speaker at Cloud related conferences, these have included: