Archive for the ‘Virtualization’ Category

AWS EC2 CPU (in)consistency – Part 4. A fix : disable half vCPU ?

January 27, 2017 Leave a comment

In prior post in this series ( older posts: part1, part2, part3 ) I used simple integer increment shell test demonstrating that Amazon EC2 Linux instances exhibit inconsistent vCPU speeds whenever number of processes actively running on CPU becomes greater than half of available vCPU. The performance differences were very big – some processes run two times slower than others. I came to conclusion that this inconsistency is explained by observed fact that OS scheduler is not starting to rebalance running processes until after number of processes exceeds number of vCPU. Only after this point the scheduler rebalancing kicks in and process speeds become more or less close.

As a next step I tried to influence scheduler behavior by changing scheduler policy. Current day Linux uses Completely Fair Scheduler (CFS) which, without going into real-time options, leaves pretty much only 3 user controlled options : SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE policies, controllable via chrt command (sched-design-CFS)

I tried all three and without going into extra details the end result was that only SCHED_IDLE demonstrated slight rebalancing in processes CPU assignments. Apparently when current shell was using SCHED_IDLE policy, other processes present on the system received higher priority and were able to preempt test processes, thus triggering rebalance. Obviously using SCHED_IDLE for normal workloads is not a good idea, so this can not be considered a viable option.

So back to square one after scheduling policy detour.

At this point I started to believe that Linux OS scheduler rebalancing strategy has a fundamental flaw where it leaves process running on same hyperthread where it started, regardless of whether its “sibling” hyperthread later become idling or busy. If all vCPU were equal than yes, there would be no point in rebalancing processes as long as number of vCPU is bigger than number of processes willing to run on CPU.

If this is true then what happens when we disable one subset of vCPU which share same core leaving the other subset online? If you think about it – if one Hyperthread based vCPU1 is already trying to completely use its core to run Process 1 – why keep insisting on squeezing in a second Hyperthread vCPU2 to run Process 2 on the same core ? Why not instead disable vCPU2 and have remaining vCPU1 serve both Process 1 and Process 2 ? There is no CPU cache sharing expected here so any theoretical benefit of hyperthreading is doubtful while drawbacks of not rebalancing are obvious.

With this reasoning I proceeded to disable half of vCPU :

-- determine vCPUs on same core
lscpu -a -e
	0   0    0      0    0:0:0:0       yes
	1   0    0      1    1:1:1:0       yes
	2   0    0      2    2:2:2:0       yes
	3   0    0      3    3:3:3:0       yes
	4   0    0      4    4:4:4:0       yes
	5   0    0      5    5:5:5:0       yes
	6   0    0      6    6:6:6:0       yes
	7   0    0      7    7:7:7:0       yes
	8   0    0      0    0:0:0:0       yes
	9   0    0      1    1:1:1:0       yes
	10  0    0      2    2:2:2:0       yes
	11  0    0      3    3:3:3:0       yes
	12  0    0      4    4:4:4:0       yes
	13  0    0      5    5:5:5:0       yes
	14  0    0      6    6:6:6:0       yes
	15  0    0      7    7:7:7:0       yes

-- disable one set of vCPUs
-- become root
cd /sys/devices/system/cpu
lscpu -a -e
for i in {8..15} ; do
echo 0 > cpu$i/online

lscpu -a -e
	0   0    0      0    0:0:0:0       yes
	1   0    0      1    1:1:1:0       yes
	2   0    0      2    2:2:2:0       yes
	3   0    0      3    3:3:3:0       yes
	4   0    0      4    4:4:4:0       yes
	5   0    0      5    5:5:5:0       yes
	6   0    0      6    6:6:6:0       yes
	7   0    0      7    7:7:7:0       yes
	8   -    -      -    :::           no
	9   -    -      -    :::           no
	10  -    -      -    :::           no
	11  -    -      -    :::           no
	12  -    -      -    :::           no
	13  -    -      -    :::           no
	14  -    -      -    :::           no
	15  -    -      -    :::           no

-- rerun the test again
for i in {1..32} ; do
lc $i 10
sleep 3
done | tee parallel_shell_disabled_vCPUs.log

-- while the test running monitor in another shell:
watch "ps -e -o user,pid,psr,comm,s | grep bash| grep R | sort -n -k3"

Here are the results, side by side :


As we can see after second set of vCPU was disabled the consistency became much better in vCPU/2-vCPU range.

While the test was running, with ps I could also see that PSR field started changing immediately after number of processes became more than vCPU/2, meaning that OS scheduler started to rebalance.

So what does this all mean for average AWS customer ? Are we suggesting as a matter of best practice to disable half of vCPU ?

Here we come to difficult question.

On one hand, the test results are so obvious that the answer seems to be a nobrainer – yes, better to disable for response time consistency sake.

On other hand, this recommendation makes whole billing situation awkward – Amazon charges by hour based on instance class where instance class is tied to number of vCPU, so effectively it charges by vCPU. Then why would I pay for disabled vCPU ? If this situation is true – why is not everybody complaining ?

We must discuss here under what circumstances this effect will NOT be observed.

If we watch closely the ps PSR field when running the test, we notice that scheduler does very good job of INITIAL balancing of processes between available cores. For example, in N<vCPU/2 range you will never see more than one process per core, and in vCPU/2

To verify this effect I decided to run three more tests where CPU-intensive load is intermixed with some kind of waits.

pipe gzip shell test

-- to build graph
-- run in bash:

( for i in {1..32} ; do
   echo "Running $i parallel gzip";
   for ((k=0;k<$i;k++)); do         ( ( dd if=/dev/zero bs=1M count=2048 | gzip -c > /dev/null ) 2>&1 | grep bytes ) &
   wait ;
   sleep 1;
        Running 1 parallel gzip
        2147483648 bytes (2.1 GB) copied, 15.7105 s, 137 MB/s
        Running 2 parallel gzip
        2147483648 bytes (2.1 GB) copied, 15.9782 s, 134 MB/s
        2147483648 bytes (2.1 GB) copied, 16.0045 s, 134 MB/s

This runs N background gzip commands having input via pipe from dd if=/dev/null. While most of the elapsed time is spent on CPU-intensive gzip, the pipe presence adds element of waits.

Here is plotted test result :


As we can see, when CPU-intensive process has numerous (even brief) waits, the OS scheduler has a chance to rebalance and results are more consistent than in integer increment tests. (But while consistency is better, it is still far from uniform ).

Database CPU-intensive workload – long sql

This test starts N sqlplus sessions in background, waits for connections to establish then runs CPU-intensive sql and measures elapsed times. (See script …)

Left chart below shows how sql elapsed times fluctuated when number of active sqlplus sessions N was in vCPU/2

Middle chart shows that disabling half of vCPU considerably reduced fluctuations by letting scheduler to rebalance, while at the same time maintaining overall throughput..

Right chart shows comparison for non-virtualized on-premise environment.


Database CPU-intensive workload – short sql

The reason to run this test was that during long-running sql testing I observed that rebalancing if happened it happened every couple of seconds. Therefore, for short-running sqls there may not be enough time for rebalance to kick in and make a difference.
To test this I changed sql to run less time. As shown below, for couple seconds duration sql workloads the disable half vCPU fix made only marginal improvement:



As we can see, vCPU  speed inconsistency is workload dependent:

1) when the workload is CPU only, the observed variability is 100% whenever number of active processes exceeds vCPU/2 and below vCPU. 100% variability means that some processes run twice as slow as others.

2) when the workload is a mixture of CPU and waits ( IO, network, IPC communications, etc) then the effect is less, but still noticeable.

Potential application impacts may be : jumpy response times, timeouts.
Potential database situations may be: bugs where sessions starts spinning on CPU waiting on a latch; sql optimizer choosing CPU-intensive plan like hash joins.

The behavior is explained by Linux OS scheduler not willing to rebalance running processes between vCPUs after initial balancing. Since AWS provisions vCPU as Intel Xeon HyperThread and two HyperThreads are sharing single physical core, this means that in the absence of rebalancing the process speed depends on whether its vCPU shares a core with another CPU-intensive process or not.

Preventative measure 1: size AWS instance big enough that number of active processes never exceeds vCPU/2 (this in effect means never letting aggregated CPU utilization go above 50%)

Preventative measure 2: if you have to run more parallel active processes than number of cores, then consider disabling half of vCPU to improve system predictability.


AWS EC2 CPU (in)consistency – Part 3. Simple Shell test.

January 26, 2017 1 comment


In this blog post I will describe a simple shell script for measuring CPU consistency.

In my prior tests I used sqlplus sessions running a CPU-intensive SQL in parallel background processes. That approach revealed that there was something strange in the way vCPU behaved in AWS EC2 environment. I noticed that whenever number of parallel sqlplus sessions became one more than vCPU/2 there were always two unlucky sessions which ran substantially slower than the rest of sessions – up to 50% slower. Obviously this was not good because SQL elapsed time consistency is very important.  I also noticed that while these parallel processes were running – their per CPU assignment in “top” never changed. Since we know that AWS EC2 presents Hyperthread as vCPU and that Intel Xeon has two Hyperthreads per physical Core – it is clear that if two sessions share one core they will run slower than a session running on dedicated core. Question is – why processes do not move between cores ? Is not this a job of OS scheduler to give all processes fair share of CPU usage ? This observation meant that there was something fundamental related to OS scheduling which lead to inconsistent SQL elapsed time.

If this was true, then this behavior must reveal itself in other situations outside of SQL and RDBMS.

I decided to try and use only shell script to generate CPU load.

Read more…

AWS RDS and EC2 CPU performance (in)consistency – Part 2

January 10, 2017 1 comment

After discovering that in my prior tests there was significant variable factor of sqlplus session connect time (which still needs to be researched but this is for a later time), I decided to try and isolate this factor so that the test would be focused more on sql elapsed time as a measure of CPU performance.

I modified the script to have large 15 sec delay to setup connections and also to separately report session creation times and sql execution times. I also printed “ps” PSR field for corresponding server processes in a hope to spot any dependency. The test sql was also modified in order to increase elapsed time :

with t as ( SELECT rownum FROM dual CONNECT BY LEVEL &lt;= 200 )
select /*+ ALL_ROWS */ count(*) as cnt from t,t,t,t ;

Full script is listed in the end of this post.

With that on c4.4xlarge EC2 instance (i.e. 16 vCPUs / 8 cores / 16 Hyperthreads ) I am observing following:

When number of parallel sessions is less or equal to 8, all sessions sql elapsed time is consistent at about 40 seconds. For example with 8 sessions:


However when I add one more session to make it 9, the sql elapsed time on two sessions out of 9 increases by 50%  :


I interpret this as additional session starting to share a core with one of the other 8 sessions, thus making the two sessions run slower.

Read more…

AWS RDS and EC2 CPU performance (in)consistency

January 5, 2017 1 comment

After our company decided to join public cloud bandwagon and move its databases to AWS, I got curious of what exactly we are getting in terms of CPU performance in RDS or EC2. I asked my friend who had already established Oracle instance in AWS RDS to run same CPU-intensive SQL which I previously used to compare various database platforms as described here.

The test SQL is very simple:

with t as ( SELECT rownum FROM dual CONNECT BY LEVEL <= 100 )
select /*+ ALL_ROWS */ count(*) as cnt from t,t,t,t ;

rds-jigsawThe sql generates 100 rows and then joins the resultset to itself 4 times producing 100,000,000 records. The beauty of this sql is that it does not generate any IO so elapsed time depends only on CPU and RAM speed.

What my friend observed when running this on RDS was that in general CPU performance was what is expected from modern Intel Xeon E5-xxxx processors, with understanding that Amazon vCPU count is a hyper thread count and not real CPU core count. One unusual behavior however was a jigsaw pattern where there was a substantial performance drop in the overall growth chart, like in the chart on the right.

On closer examination it turned out that there is noticeable variation in test sql execution elapsed time when parallel sessions go above 1-2. So the throughput chart would not reproduce itself on different runs.

This led me to investigate AWS vCPU speed consistency by continuously executing test sql in parallel sqlplus sessions over period of time and measuring elapsed time variations.

Read more…

Categories: AWS, CPU speed, EC2, RDS, Virtualization