Archive

Archive for the ‘CPU speed’ Category

AWS EC2 CPU (in)consistency – Part 4. A fix : disable half vCPU ?

January 27, 2017 Leave a comment

In prior post in this series ( older posts: part1, part2, part3 ) I used simple integer increment shell test demonstrating that Amazon EC2 Linux instances exhibit inconsistent vCPU speeds whenever number of processes actively running on CPU becomes greater than half of available vCPU. The performance differences were very big – some processes run two times slower than others. I came to conclusion that this inconsistency is explained by observed fact that OS scheduler is not starting to rebalance running processes until after number of processes exceeds number of vCPU. Only after this point the scheduler rebalancing kicks in and process speeds become more or less close.

As a next step I tried to influence scheduler behavior by changing scheduler policy. Current day Linux uses Completely Fair Scheduler (CFS) which, without going into real-time options, leaves pretty much only 3 user controlled options : SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE policies, controllable via chrt command (sched-design-CFS)

I tried all three and without going into extra details the end result was that only SCHED_IDLE demonstrated slight rebalancing in processes CPU assignments. Apparently when current shell was using SCHED_IDLE policy, other processes present on the system received higher priority and were able to preempt test processes, thus triggering rebalance. Obviously using SCHED_IDLE for normal workloads is not a good idea, so this can not be considered a viable option.

So back to square one after scheduling policy detour.

At this point I started to believe that Linux OS scheduler rebalancing strategy has a fundamental flaw where it leaves process running on same hyperthread where it started, regardless of whether its “sibling” hyperthread later become idling or busy. If all vCPU were equal than yes, there would be no point in rebalancing processes as long as number of vCPU is bigger than number of processes willing to run on CPU.

If this is true then what happens when we disable one subset of vCPU which share same core leaving the other subset online? If you think about it – if one Hyperthread based vCPU1 is already trying to completely use its core to run Process 1 – why keep insisting on squeezing in a second Hyperthread vCPU2 to run Process 2 on the same core ? Why not instead disable vCPU2 and have remaining vCPU1 serve both Process 1 and Process 2 ? There is no CPU cache sharing expected here so any theoretical benefit of hyperthreading is doubtful while drawbacks of not rebalancing are obvious.

With this reasoning I proceeded to disable half of vCPU :

-- determine vCPUs on same core
lscpu -a -e
	CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
	0   0    0      0    0:0:0:0       yes
	1   0    0      1    1:1:1:0       yes
	2   0    0      2    2:2:2:0       yes
	3   0    0      3    3:3:3:0       yes
	4   0    0      4    4:4:4:0       yes
	5   0    0      5    5:5:5:0       yes
	6   0    0      6    6:6:6:0       yes
	7   0    0      7    7:7:7:0       yes
	8   0    0      0    0:0:0:0       yes
	9   0    0      1    1:1:1:0       yes
	10  0    0      2    2:2:2:0       yes
	11  0    0      3    3:3:3:0       yes
	12  0    0      4    4:4:4:0       yes
	13  0    0      5    5:5:5:0       yes
	14  0    0      6    6:6:6:0       yes
	15  0    0      7    7:7:7:0       yes

-- disable one set of vCPUs
-- become root
cd /sys/devices/system/cpu
lscpu -a -e
for i in {8..15} ; do
echo 0 > cpu$i/online
done

lscpu -a -e
	CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
	0   0    0      0    0:0:0:0       yes
	1   0    0      1    1:1:1:0       yes
	2   0    0      2    2:2:2:0       yes
	3   0    0      3    3:3:3:0       yes
	4   0    0      4    4:4:4:0       yes
	5   0    0      5    5:5:5:0       yes
	6   0    0      6    6:6:6:0       yes
	7   0    0      7    7:7:7:0       yes
	8   -    -      -    :::           no
	9   -    -      -    :::           no
	10  -    -      -    :::           no
	11  -    -      -    :::           no
	12  -    -      -    :::           no
	13  -    -      -    :::           no
	14  -    -      -    :::           no
	15  -    -      -    :::           no

-- rerun the test again
for i in {1..32} ; do
lc $i 10
sleep 3
done | tee parallel_shell_disabled_vCPUs.log

-- while the test running monitor in another shell:
watch "ps -e -o user,pid,psr,comm,s | grep bash| grep R | sort -n -k3"

Here are the results, side by side :

disabling-vcpu-side-by-side-small

As we can see after second set of vCPU was disabled the consistency became much better in vCPU/2-vCPU range.

While the test was running, with ps I could also see that PSR field started changing immediately after number of processes became more than vCPU/2, meaning that OS scheduler started to rebalance.

So what does this all mean for average AWS customer ? Are we suggesting as a matter of best practice to disable half of vCPU ?

Here we come to difficult question.

On one hand, the test results are so obvious that the answer seems to be a nobrainer – yes, better to disable for response time consistency sake.

On other hand, this recommendation makes whole billing situation awkward – Amazon charges by hour based on instance class where instance class is tied to number of vCPU, so effectively it charges by vCPU. Then why would I pay for disabled vCPU ? If this situation is true – why is not everybody complaining ?

We must discuss here under what circumstances this effect will NOT be observed.

If we watch closely the ps PSR field when running the test, we notice that scheduler does very good job of INITIAL balancing of processes between available cores. For example, in N<vCPU/2 range you will never see more than one process per core, and in vCPU/2

To verify this effect I decided to run three more tests where CPU-intensive load is intermixed with some kind of waits.

pipe gzip shell test

-- to build graph
-- run in bash:

( for i in {1..32} ; do
   echo "Running $i parallel gzip";
   for ((k=0;k<$i;k++)); do         ( ( dd if=/dev/zero bs=1M count=2048 | gzip -c > /dev/null ) 2>&1 | grep bytes ) &
   done;
   wait ;
   sleep 1;
done)
        Running 1 parallel gzip
        2147483648 bytes (2.1 GB) copied, 15.7105 s, 137 MB/s
        Running 2 parallel gzip
        2147483648 bytes (2.1 GB) copied, 15.9782 s, 134 MB/s
        2147483648 bytes (2.1 GB) copied, 16.0045 s, 134 MB/s
        ...

This runs N background gzip commands having input via pipe from dd if=/dev/null. While most of the elapsed time is spent on CPU-intensive gzip, the pipe presence adds element of waits.

Here is plotted test result :

gzip_dd_pipe_test

As we can see, when CPU-intensive process has numerous (even brief) waits, the OS scheduler has a chance to rebalance and results are more consistent than in integer increment tests. (But while consistency is better, it is still far from uniform ).

Database CPU-intensive workload – long sql

This test starts N sqlplus sessions in background, waits for connections to establish then runs CPU-intensive sql and measures elapsed times. (See script …)

Left chart below shows how sql elapsed times fluctuated when number of active sqlplus sessions N was in vCPU/2

Middle chart shows that disabling half of vCPU considerably reduced fluctuations by letting scheduler to rebalance, while at the same time maintaining overall throughput..

Right chart shows comparison for non-virtualized on-premise environment.

disable_vcpu_side_by_side

Database CPU-intensive workload – short sql

The reason to run this test was that during long-running sql testing I observed that rebalancing if happened it happened every couple of seconds. Therefore, for short-running sqls there may not be enough time for rebalance to kick in and make a difference.
To test this I changed sql to run less time. As shown below, for couple seconds duration sql workloads the disable half vCPU fix made only marginal improvement:

disable_vcpu_short_sql

Conclusion

As we can see, vCPU  speed inconsistency is workload dependent:

1) when the workload is CPU only, the observed variability is 100% whenever number of active processes exceeds vCPU/2 and below vCPU. 100% variability means that some processes run twice as slow as others.

2) when the workload is a mixture of CPU and waits ( IO, network, IPC communications, etc) then the effect is less, but still noticeable.

Potential application impacts may be : jumpy response times, timeouts.
Potential database situations may be: bugs where sessions starts spinning on CPU waiting on a latch; sql optimizer choosing CPU-intensive plan like hash joins.

The behavior is explained by Linux OS scheduler not willing to rebalance running processes between vCPUs after initial balancing. Since AWS provisions vCPU as Intel Xeon HyperThread and two HyperThreads are sharing single physical core, this means that in the absence of rebalancing the process speed depends on whether its vCPU shares a core with another CPU-intensive process or not.

Preventative measure 1: size AWS instance big enough that number of active processes never exceeds vCPU/2 (this in effect means never letting aggregated CPU utilization go above 50%)

Preventative measure 2: if you have to run more parallel active processes than number of cores, then consider disabling half of vCPU to improve system predictability.

Advertisements

AWS EC2 CPU (in)consistency – Part 3. Simple Shell test.

January 26, 2017 1 comment

 

In this blog post I will describe a simple shell script for measuring CPU consistency.

In my prior tests I used sqlplus sessions running a CPU-intensive SQL in parallel background processes. That approach revealed that there was something strange in the way vCPU behaved in AWS EC2 environment. I noticed that whenever number of parallel sqlplus sessions became one more than vCPU/2 there were always two unlucky sessions which ran substantially slower than the rest of sessions – up to 50% slower. Obviously this was not good because SQL elapsed time consistency is very important.  I also noticed that while these parallel processes were running – their per CPU assignment in “top” never changed. Since we know that AWS EC2 presents Hyperthread as vCPU and that Intel Xeon has two Hyperthreads per physical Core – it is clear that if two sessions share one core they will run slower than a session running on dedicated core. Question is – why processes do not move between cores ? Is not this a job of OS scheduler to give all processes fair share of CPU usage ? This observation meant that there was something fundamental related to OS scheduling which lead to inconsistent SQL elapsed time.

If this was true, then this behavior must reveal itself in other situations outside of SQL and RDBMS.

I decided to try and use only shell script to generate CPU load.

Read more…

AWS RDS and EC2 CPU performance (in)consistency – Part 2

January 10, 2017 1 comment

After discovering that in my prior tests there was significant variable factor of sqlplus session connect time (which still needs to be researched but this is for a later time), I decided to try and isolate this factor so that the test would be focused more on sql elapsed time as a measure of CPU performance.

I modified the script to have large 15 sec delay to setup connections and also to separately report session creation times and sql execution times. I also printed “ps” PSR field for corresponding server processes in a hope to spot any dependency. The test sql was also modified in order to increase elapsed time :

with t as ( SELECT rownum FROM dual CONNECT BY LEVEL &lt;= 200 )
select /*+ ALL_ROWS */ count(*) as cnt from t,t,t,t ;

Full script is listed in the end of this post.

With that on c4.4xlarge EC2 instance (i.e. 16 vCPUs / 8 cores / 16 Hyperthreads ) I am observing following:

When number of parallel sessions is less or equal to 8, all sessions sql elapsed time is consistent at about 40 seconds. For example with 8 sessions:

8-sessions

However when I add one more session to make it 9, the sql elapsed time on two sessions out of 9 increases by 50%  :

9-sessions

I interpret this as additional session starting to share a core with one of the other 8 sessions, thus making the two sessions run slower.

Read more…

AWS RDS and EC2 CPU performance (in)consistency

January 5, 2017 1 comment

After our company decided to join public cloud bandwagon and move its databases to AWS, I got curious of what exactly we are getting in terms of CPU performance in RDS or EC2. I asked my friend who had already established Oracle instance in AWS RDS to run same CPU-intensive SQL which I previously used to compare various database platforms as described here.

The test SQL is very simple:

with t as ( SELECT rownum FROM dual CONNECT BY LEVEL <= 100 )
select /*+ ALL_ROWS */ count(*) as cnt from t,t,t,t ;

rds-jigsawThe sql generates 100 rows and then joins the resultset to itself 4 times producing 100,000,000 records. The beauty of this sql is that it does not generate any IO so elapsed time depends only on CPU and RAM speed.

What my friend observed when running this on RDS was that in general CPU performance was what is expected from modern Intel Xeon E5-xxxx processors, with understanding that Amazon vCPU count is a hyper thread count and not real CPU core count. One unusual behavior however was a jigsaw pattern where there was a substantial performance drop in the overall growth chart, like in the chart on the right.

On closer examination it turned out that there is noticeable variation in test sql execution elapsed time when parallel sessions go above 1-2. So the throughput chart would not reproduce itself on different runs.

This led me to investigate AWS vCPU speed consistency by continuously executing test sql in parallel sqlplus sessions over period of time and measuring elapsed time variations.

Read more…

Categories: AWS, CPU speed, EC2, RDS, Virtualization

System Throughput Comparison

April 1, 2012 2 comments

CPU Throughput Comparison

Recently I have compared single-process speeds of systems with different CPUs by running a simple test SQL on Oracle database. The SQL was simple and universal in that it produced same execution plans and execution statistics on both 10g and 11g and on all architectures I was able to get my hands on – Intel Xeon, IBM Power7, HP Itanium2, Sun UltraSPARC. This makes it possible to quickly compare relative CPU speed for a single database process. The result of the comparison was a surprisingly slow UltraSPARC T2 performance. These processors implement Chip Multi Threading where few physical cores run massive number of threads, with each thread presented as virtual CPU to the OS. The Sun assumption was that with CPU frequencies in Giga Hz the CPU spends most of the time waiting on RAM access and that this wait could be better utilized by running virtual threads. This sounds good in theory, but it seems like UltraSPARC T2 went overboard with virtualization. For example, one of my T5240 boxes has 2 physical CPUS, 12 cores and 96 virtual CPUs (threads). When I make a connection to database on this server and run something continuously and observe CPU utilization, the utilization never goes above 1-2%. It takes extraordinary efforts to get all the virtual CPUs working. For example, you have to run RMAN backup with parallel degree in hundreds to get into 30-50% cpu utilization range. The whole system acts as one with huge number of slow CPUs.

When I brought these observations up to our hardware planners, their counter-argument was – ok, single process may be somewhat slower, but with hundreds of database connections the total system throughput should be great.

This got me into thinking of how can I run same SQL test with multiple parallel processes.

Read more…

Relative CPU speed

April 1, 2012 1 comment

On my job I often have to move a database to a newer hardware.
Sometimes company decides to do a tech refresh, sometimes it is a new company spin-off or a data center consolidation. When you get to work on a new hardware, it is always interesting to see what it is capable of and how does it compare to the old system.

Back when I worked for Oracle as an instructor, we would frequently go to a customer location and install database software for the class on customers hardware. We would use a “quick and dirty” method to get a feeling on just how fast the systems CPU was. The method was to run same simple sql and measure execution time.
The sql was

select count(*) from emp,emp,emp,emp,emp

The beauty of this sql was that it required almost no IO.
There were only 14 records in EMP table and it took only 1 IO to read the block into buffer cache.
After that the 5-way cartesian join execution time would only depend on the CPU speed.
Every database we installed had Scott schema with EMP table, which made this simple sql a universally available CPU speed test tool.

These days I work for a company which does not install sample SCOTT schema.
So I decided to do something similar without EMP table.

The sql I came up with was this:

with t as ( SELECT rownum FROM dual CONNECT BY LEVEL <= 100 )
select /*+ ALL_ROWS */ count(*) from t,t,t,t
/
COUNT(*)
----------
100000000

This sql does not need any table and therefore will run on any Oracle database.
It generates 100 records by selecting from dual and then joins the resultset to itself 4 times, producing count of 100 millions. Being select from dual, this sql needs miniscule IO and its elapsed time mostly depends on CPU speed. Since the sql does not use parallel queries, the execution time can be a measure of single-process CPU speed.

Read more…