Home > AWS, CPU speed, EC2, Hyperthreading, Linux Scheduler, vCPU speed, Virtualization > AWS EC2 CPU (in)consistency – Part 4. A fix : disable half vCPU ?

AWS EC2 CPU (in)consistency – Part 4. A fix : disable half vCPU ?

In prior post in this series ( older posts: part1, part2, part3 ) I used simple integer increment shell test demonstrating that Amazon EC2 Linux instances exhibit inconsistent vCPU speeds whenever number of processes actively running on CPU becomes greater than half of available vCPU. The performance differences were very big – some processes run two times slower than others. I came to conclusion that this inconsistency is explained by observed fact that OS scheduler is not starting to rebalance running processes until after number of processes exceeds number of vCPU. Only after this point the scheduler rebalancing kicks in and process speeds become more or less close.

As a next step I tried to influence scheduler behavior by changing scheduler policy. Current day Linux uses Completely Fair Scheduler (CFS) which, without going into real-time options, leaves pretty much only 3 user controlled options : SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE policies, controllable via chrt command (sched-design-CFS)

I tried all three and without going into extra details the end result was that only SCHED_IDLE demonstrated slight rebalancing in processes CPU assignments. Apparently when current shell was using SCHED_IDLE policy, other processes present on the system received higher priority and were able to preempt test processes, thus triggering rebalance. Obviously using SCHED_IDLE for normal workloads is not a good idea, so this can not be considered a viable option.

So back to square one after scheduling policy detour.

At this point I started to believe that Linux OS scheduler rebalancing strategy has a fundamental flaw where it leaves process running on same hyperthread where it started, regardless of whether its “sibling” hyperthread later become idling or busy. If all vCPU were equal than yes, there would be no point in rebalancing processes as long as number of vCPU is bigger than number of processes willing to run on CPU.

If this is true then what happens when we disable one subset of vCPU which share same core leaving the other subset online? If you think about it – if one Hyperthread based vCPU1 is already trying to completely use its core to run Process 1 – why keep insisting on squeezing in a second Hyperthread vCPU2 to run Process 2 on the same core ? Why not instead disable vCPU2 and have remaining vCPU1 serve both Process 1 and Process 2 ? There is no CPU cache sharing expected here so any theoretical benefit of hyperthreading is doubtful while drawbacks of not rebalancing are obvious.

With this reasoning I proceeded to disable half of vCPU :

-- determine vCPUs on same core
lscpu -a -e
	CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
	0   0    0      0    0:0:0:0       yes
	1   0    0      1    1:1:1:0       yes
	2   0    0      2    2:2:2:0       yes
	3   0    0      3    3:3:3:0       yes
	4   0    0      4    4:4:4:0       yes
	5   0    0      5    5:5:5:0       yes
	6   0    0      6    6:6:6:0       yes
	7   0    0      7    7:7:7:0       yes
	8   0    0      0    0:0:0:0       yes
	9   0    0      1    1:1:1:0       yes
	10  0    0      2    2:2:2:0       yes
	11  0    0      3    3:3:3:0       yes
	12  0    0      4    4:4:4:0       yes
	13  0    0      5    5:5:5:0       yes
	14  0    0      6    6:6:6:0       yes
	15  0    0      7    7:7:7:0       yes

-- disable one set of vCPUs
-- become root
cd /sys/devices/system/cpu
lscpu -a -e
for i in {8..15} ; do
echo 0 > cpu$i/online
done

lscpu -a -e
	CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
	0   0    0      0    0:0:0:0       yes
	1   0    0      1    1:1:1:0       yes
	2   0    0      2    2:2:2:0       yes
	3   0    0      3    3:3:3:0       yes
	4   0    0      4    4:4:4:0       yes
	5   0    0      5    5:5:5:0       yes
	6   0    0      6    6:6:6:0       yes
	7   0    0      7    7:7:7:0       yes
	8   -    -      -    :::           no
	9   -    -      -    :::           no
	10  -    -      -    :::           no
	11  -    -      -    :::           no
	12  -    -      -    :::           no
	13  -    -      -    :::           no
	14  -    -      -    :::           no
	15  -    -      -    :::           no

-- rerun the test again
for i in {1..32} ; do
lc $i 10
sleep 3
done | tee parallel_shell_disabled_vCPUs.log

-- while the test running monitor in another shell:
watch "ps -e -o user,pid,psr,comm,s | grep bash| grep R | sort -n -k3"

Here are the results, side by side :

disabling-vcpu-side-by-side-small

As we can see after second set of vCPU was disabled the consistency became much better in vCPU/2-vCPU range.

While the test was running, with ps I could also see that PSR field started changing immediately after number of processes became more than vCPU/2, meaning that OS scheduler started to rebalance.

So what does this all mean for average AWS customer ? Are we suggesting as a matter of best practice to disable half of vCPU ?

Here we come to difficult question.

On one hand, the test results are so obvious that the answer seems to be a nobrainer – yes, better to disable for response time consistency sake.

On other hand, this recommendation makes whole billing situation awkward – Amazon charges by hour based on instance class where instance class is tied to number of vCPU, so effectively it charges by vCPU. Then why would I pay for disabled vCPU ? If this situation is true – why is not everybody complaining ?

We must discuss here under what circumstances this effect will NOT be observed.

If we watch closely the ps PSR field when running the test, we notice that scheduler does very good job of INITIAL balancing of processes between available cores. For example, in N<vCPU/2 range you will never see more than one process per core, and in vCPU/2

To verify this effect I decided to run three more tests where CPU-intensive load is intermixed with some kind of waits.

pipe gzip shell test

-- to build graph
-- run in bash:

( for i in {1..32} ; do
   echo "Running $i parallel gzip";
   for ((k=0;k<$i;k++)); do         ( ( dd if=/dev/zero bs=1M count=2048 | gzip -c > /dev/null ) 2>&1 | grep bytes ) &
   done;
   wait ;
   sleep 1;
done)
        Running 1 parallel gzip
        2147483648 bytes (2.1 GB) copied, 15.7105 s, 137 MB/s
        Running 2 parallel gzip
        2147483648 bytes (2.1 GB) copied, 15.9782 s, 134 MB/s
        2147483648 bytes (2.1 GB) copied, 16.0045 s, 134 MB/s
        ...

This runs N background gzip commands having input via pipe from dd if=/dev/null. While most of the elapsed time is spent on CPU-intensive gzip, the pipe presence adds element of waits.

Here is plotted test result :

gzip_dd_pipe_test

As we can see, when CPU-intensive process has numerous (even brief) waits, the OS scheduler has a chance to rebalance and results are more consistent than in integer increment tests. (But while consistency is better, it is still far from uniform ).

Database CPU-intensive workload – long sql

This test starts N sqlplus sessions in background, waits for connections to establish then runs CPU-intensive sql and measures elapsed times. (See script …)

Left chart below shows how sql elapsed times fluctuated when number of active sqlplus sessions N was in vCPU/2

Middle chart shows that disabling half of vCPU considerably reduced fluctuations by letting scheduler to rebalance, while at the same time maintaining overall throughput..

Right chart shows comparison for non-virtualized on-premise environment.

disable_vcpu_side_by_side

Database CPU-intensive workload – short sql

The reason to run this test was that during long-running sql testing I observed that rebalancing if happened it happened every couple of seconds. Therefore, for short-running sqls there may not be enough time for rebalance to kick in and make a difference.
To test this I changed sql to run less time. As shown below, for couple seconds duration sql workloads the disable half vCPU fix made only marginal improvement:

disable_vcpu_short_sql

Conclusion

As we can see, vCPU  speed inconsistency is workload dependent:

1) when the workload is CPU only, the observed variability is 100% whenever number of active processes exceeds vCPU/2 and below vCPU. 100% variability means that some processes run twice as slow as others.

2) when the workload is a mixture of CPU and waits ( IO, network, IPC communications, etc) then the effect is less, but still noticeable.

Potential application impacts may be : jumpy response times, timeouts.
Potential database situations may be: bugs where sessions starts spinning on CPU waiting on a latch; sql optimizer choosing CPU-intensive plan like hash joins.

The behavior is explained by Linux OS scheduler not willing to rebalance running processes between vCPUs after initial balancing. Since AWS provisions vCPU as Intel Xeon HyperThread and two HyperThreads are sharing single physical core, this means that in the absence of rebalancing the process speed depends on whether its vCPU shares a core with another CPU-intensive process or not.

Preventative measure 1: size AWS instance big enough that number of active processes never exceeds vCPU/2 (this in effect means never letting aggregated CPU utilization go above 50%)

Preventative measure 2: if you have to run more parallel active processes than number of cores, then consider disabling half of vCPU to improve system predictability.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: