Home > AWS, CPU speed, EC2, RDS, Virtualization > AWS RDS and EC2 CPU performance (in)consistency

AWS RDS and EC2 CPU performance (in)consistency

After our company decided to join public cloud bandwagon and move its databases to AWS, I got curious of what exactly we are getting in terms of CPU performance in RDS or EC2. I asked my friend who had already established Oracle instance in AWS RDS to run same CPU-intensive SQL which I previously used to compare various database platforms as described here.

The test SQL is very simple:

with t as ( SELECT rownum FROM dual CONNECT BY LEVEL <= 100 )
select /*+ ALL_ROWS */ count(*) as cnt from t,t,t,t ;

rds-jigsawThe sql generates 100 rows and then joins the resultset to itself 4 times producing 100,000,000 records. The beauty of this sql is that it does not generate any IO so elapsed time depends only on CPU and RAM speed.

What my friend observed when running this on RDS was that in general CPU performance was what is expected from modern Intel Xeon E5-xxxx processors, with understanding that Amazon vCPU count is a hyper thread count and not real CPU core count. One unusual behavior however was a jigsaw pattern where there was a substantial performance drop in the overall growth chart, like in the chart on the right.

On closer examination it turned out that there is noticeable variation in test sql execution elapsed time when parallel sessions go above 1-2. So the throughput chart would not reproduce itself on different runs.

This led me to investigate AWS vCPU speed consistency by continuously executing test sql in parallel sqlplus sessions over period of time and measuring elapsed time variations.

The results show not only c4-4xlargethat there is inconsistent CPU performance, but also that there is very curious inconsistency pattern – the elapsed times are not just randomly scattered around average value, but instead form a second “preferred” elapsed time value. The second preferred elapsed time value was considerably longer than “normal” elapsed time – 3 sec vs 8 seconds.

I tested this on 2 different EC2 instances (m4.4xlarge and c4.4xlarge) where sqlplus client was collocated with the db server, and on 2 different RDS instances (both db.m4.4xlarge) with sqlplus client residing either on premise or on AWS EC2, with Oracle 11g and 12c. All tests showed similar pattern.

What is the underlying reason for this pattern ?

It is very interesting question which I do not have the answer for.

There are several potential forces at play here :

  1. “Noisy neighbors”. These obviously can take away CPU cycles because AWS hardware is shared.
  2. Xen Hypervisor which AWS uses to manage virtual machines. If Hypervisor decides that this specific vCPU used too much of physical CPU it may decide to de-schedule it from execution.
  3. Intel Xeon CPUs ability to change frequency and go to sleep states – so called C-states, P-states and TurboBoost
  4. Intel Xeon CPUs Hyper threading – we know that AWS gives hyper thread as vCPU instead of core. If two vCPUs (threads) wanting to use CPU cycles happen to land on the same core, then they will have to share this core.

In practical terms – how much of an impact this CPU inconsistency may have on real world application ? Well, as always it depends on database workload pattern. If vCPUs will be spending most of the time doing System or User IO, then CPU speed variations may be drowned in longer IO waits so application may not notice. There may be however applications with databases having substantial CPU time component so that CPU inconsistency may become uncomfortable.

From practical point of view before moving to AWS cloud one can use this approach: look at the OEM Active Sessions page while database runs in existing on-premise environment. If CPU portion (green) is much lower than System or User IO portion (blue) – then this database workload is probably good candidate to move to AWS. If there is lot of green then AWS may become a risky move.

Update Jan 1/10/2017: I have gathered more timing details and discovered a significant masking factor. The sqlplus session creation time was counted as part of the test elapsed time. As it turned out, in my AWS tests sqlplus session creation time was either a fraction of a second or it jumped to 5 seconds. The actual sql elapsed time was more stable – at least with number of parallel sessions not exceeding half of vCPUs. In my next blog post I modified test script to exclude connection time variations. The question however remains of why there are such big connect time variations. I will leave this to investigate at a later time.

Categories: AWS, CPU speed, EC2, RDS, Virtualization

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: