Historically performance tuning was mostly to improve throughput and this is how today's systems are delivered. Trading systems are racing to lower their latency. Jitter is also a critical factor, so we re-tune the systems to minimize both latency and jitter. In doing this we are having to make trade-offs around CPU utilization, power consumption and memory footprint. We also collaborate with application development teams to design in low latency and low jitter from the beginning.

We specialise at the infrastructure layer, selecting and tuning each component and then working closely with your own application development team to minimize the end-to-end latency.

We are one of the few organizations who can work independently across this entire infrastructure stack. We carry out Operating System configuration and low latency tuning, network design and low latency tuning, storage design and low latency tuning.

We've developed our own latency insertion model which allows us to forecast the expected latency benefits of infrastructure changes ahead of the effort of making them. We've refined this model based on observed results.

We've developed our own latency test tools which provide deep insight into the latency, jitter and drop characteristics, These support both multicast and TCP and also allows use to stress test networks under extremely high packet loads.

We use a Latency Framework that we developed that helps identify where best to invest. In simple terms it's an 80:20 rule. If the infrastructure is more than 20% of the end to end latency (excluding any long haul propagation time, then you need to invest to improve it. If it is already less than this, then it is best to invest in application latency tuning. We define the infrastructure as everything below the network API's on the servers (typically the sockets) and everything in between including all network devices in path and any storage devices in the critical code paths.

There are four stages to improving the latency of the network stack:

  1. Selection the carrier with the optimum links to minimize propagation delay. Avoid MPLS based networks unless suppliers are prepared to commit to latency and jitter.
  2. Flatten the network, minimize router hops and tune or avoid Firewall's
  3. Speed up your links to reduce serialization delay, 1Gbps minimum, 10Gbps preferred.
  4. Tune your Operating Systems for low latency including the TCP/IP stack, network devices, interrupts (particularly SMI) and power management.
  5. Segregrate your cores and statically allocate most of them to your critical application threads.

FPGA's are now providing some stellar results. They're being used for both packet processing and algorithmic work but I think the former looks the most promising since GPU's are generally more performant for highly parallelizable algorithms. The benefit of the FPGA is that it can start processing a packet as the bytes arrive, so as long as the algo is kept simple and can be run completetly on the FPGA, then sub-microsecond latency can be achieved.

Kernel based TCP/IP has it's limitations, you can typically half its contribution to latency by adopting a User space implementation such as OpenOnload. This can be used in any of three modes each with decreasing latency but with increasing API complexity. The simplest is to take an existing TCP/IP socket based program, it can even be in binary format and preload onload before it is started. The next is to use TCPdirect, which uses a simplified socket API referred to as zsockets. This requires refactoring of existing Linux socket 'C' code. The best latency however, for Onload, is to re-write your application to use their EF_VI API. This is asynchronous, using completion events so it usually requires a complete rewrite of the send/receive modukes.

Where you have control of both ends of the wire then lowest latency and jitter is obtained by bypassing TCP altogether.. There are three main candidates for this - InfiniBand, RoCE and iWARP. These all share a common set of API's known as the VERB's so it's possible to develop applications that will run on any of them. As with Openonload, SDP and Mellanox's VMA all preload to accelerate an existing TCP/IP socket program. Openonload retains the TCP/IP protocol so can be used single ended. SDP and VMA both map to VERBS so must be deployed on both ends of the wire. Best latency with Openonload is achieved by receive polling, this sacrifices a core just to receieve packets on this socket but does avoid the kernel wakeup delay of the user thread. VERB based programs can run be run on Ethernet, InfiniBand or OmniPath. Verb programs over Ethernet use either RoCE or iWARP. RoCE relies on PFC's to limit senders and the non-drop queuing for any RDMA Ethertype packets queued in the switches to provide the underlying reliability, whilst iWARP uses TCP (implemented with offload engines). Unfortunately, whilst the DCB standard includes the mechanisms to enable this, the current generation of Ethernet switches typically only enable dropless behavior for the FCoE Ethertype. Whiles RoCE programs may appear to work any drops may be undetected and result in data corruption. Large scale RDMA Ethernet deployments also need L2 mesh support to replace Spanning tree. There are a number of proprietary approaches appearing to solve this, whilst the DCB group are focusing on TRIL which we should see emerge during 2012.

The options your Ethernet switch vendor offer to support RDMA are a major differentiator so make sure you press them on this point. In our experience, InfiniBand causes the least deployment problems and is proven supporting large, critical environments although there is some retraining required for the network team.

As we are now working down at saving microseconds the time synchronization between systems has become a bigger problem. Traditional approaches such as NTP were simply not intended to provide this level of accuracy. This is an area of ongoing research for us and we now have designs that we have started to implement.

Some of the projects we have completed are:

  • Testing of commercial and design of a custom overclocked server
  • Design and accuracy testing for market data capture and archive solution for backtesting purposes
  • RDMA and VERB programming models for App dev teams
  • CoLo blueprint design covering network, servers, OS and management used to cookie cut multiple CoLo deployments.
  • FIX Engine latency and streaming feed tuning.
  • 10G Ethernet NIC test and comparison to help a well known Bank define it's standard.
  • We have designed the InfiniBand network for several clients including a European Trading Exchange and a tier 1 Investment Bank. We pay close attention to observability and managability including start of days tests and packet capture capabilities for diagnostics.
  • Deploying the first trading system in Europe to run on a long distance InfiniBand network. This bypassed all the traditional layer 3, TCP routing overhead and removed nearly 0.5mS from a critical trading link. The management and observability of this new network was a key issue to the Cisco skilled network team.
  • Worked with a Trading Application team on RDMA and low latency enabling of their applications, including strategies for avoiding or minimizing garbage collection, the evaluation and use of TCP bypass and associated OFED API's.
  • Low latency network design for a pan-European trading network.
  • Solaris and Linux network stack tuning for low latency. Included characterization of various NICs and tuning of their drivers. Latency savings of 30% achieved along with significant jitter reduction.
  • Root cause analysis of latency and reliability issues. Client was unsure whether it was his network or servers causing the problem. During the investigation we found issues with both and provided a list of remediation's to resolve the problems.
  • Ran a vendor selection process for an InfiniBand network, including requirements specification, facilitated vendor workshops and a moderated scoring.
  • Developed our own multicast test suite of programs, running on both Ethernet and InfiniBand, which provide full distribution analysis of the packets to analyse both latency and jitter in a network.
  • Managed a program to deploy InfiniBand in a critical trading environment. This required a lot a attention to the manageability and observability so that the Cisco skilled network team could successfully manage the new network. It also entailed a detailed roll out program to add InfiniBand connectivity to existing production servers. We developed our own test programs which verified the correct operation of the InfiniBand connectivity, including multicast, as each server was upgraded.

Conference and workshop presentations

Technology watch list

  • Intel DPDK - Now Intel have made this a set of user space libraries it is much more useful than in its original bare metal form. It enables high speed networking, uncontrained by the kernel.
  • Arista DANZ tap aggreagation and timestamping added to its 7150 range of switches lowers the cost of creating packet capture solutions.
  • Solarflare's addition of kernel bypass packet capture now makes dropless packet capture at 10G line rates using commodity hardware a viable proposition.
  • IEISS FIX engine, the only FIX engine we are aware of that is written in Erlang. They also have an extensive set of test tools which can be to test this and other FIX engines.
  • User level TCP/IP such as OpenOnLoad provide the lowest latency across Ethernet and is the preferred option where we cannot bypass TCP/IP altogether

Master classes

We have developed a range of training master classes covering low latency including design, TCP bypass and RDMA, InfiniBand network design and management, Accurate time measurement, long distance optimization and long distance InfiniBand connectivity. Take a look at our training and workshops.