Testing Client TCP Timeouts with Multi-AZ Oracle RDS Failover

While preparing for recent “RAC Benefits” presentation I did a quick test to see how a client behaves when Multi-AZ Oracle RDS instance fails in the middle of sql execution or between consecutive sql executions.

For the client I took a standard AMI provided by my company, created an EC2 instance and installed Oracle Client v 11.2.0.3

For the db server I created m4.large RDS instance with Multi-AZ option.

Then I connected sqlplus to the RDS and ran following pl/sql block to simulate long running SQL execution :    
execute DBMS_LOCK.sleep(30)

Then I triggered RDS failover to another AZ  and re-executed the pl/sql again in the same session.

RDS failover took 2 min 49 sec as can be seen in AWS Console:

RDS_failover_AWS_Console

The client situation was not as good though.

The next call submitted in the same sqlplus session hang for more than 15 minutes :

sqlplus_timeout

While the sqlplus was hanging, netstat on the client showed that TCP socket it was using went into TCP Retries :netstat_timeout

The client got out of TCP wait only after 15 min 35 sec, even though RDS finished failover in only 2:49 min.

Another test was to trigger RDS failover in the middle of running pl/sqlsql call.

In this case the client went into indefinite hang.

This is something definitely requiring attention in application HA planning.
The reason I did this test was to compare client behavior in Multi-AZ RDS to RAC failover situation.

With RAC we have Virtual IP, Fast Connection Failover (FCF), Fast Application Notification (FAN) to help the client session to avoid getting into TCP timeouts and TCP hangs.

There are 3 places where TCP retries and hangs can happen when the other side went down:

  1. new connection request
  2. new sql call in existing connection (like the 15+ mins hang demonstrated above))
  3. already executing call

The first case can be dealt with by using various connection retry and connection timeout parameters set either in tnsnames.ora or in connection URL

The second case in RAC is taken care of by the fact that connections are made to VIP and not to physical IP. When the VIP is transferred to another node, it is associated with new MAC so the client on next call gets immediate “connection reset” message due to MAC mismatch, instead of going into TCP retries.

The third case requires using event-driven mechanism like FCF with FAN. The reason is that when the client already submitted sql request to the db, it is now waiting to be contacted by the server with results. So TCP on the client is not even trying to help with retries. This is why RAC solution is for RAC to send FAN event to the client telling it that db instance is no longer there. This is done with the help of ONS daemons on the server side and a special FCF thread running on the client UCP connection pool which listens on FAN events coming from the server and in response marking client connection as invalid.

Above instrumentation is lacking in RDS database. This is the reason for observed client hang.

What is the solution then if we use RDS ?

A proper solution would be event-driven approach where RDS would send some kind of message in case of failover. In the absence of this message we have to rely on the client to detect failure.

Traditionally people were trying to either manipulate various TCP socket timeouts and retries or use TCP KeepAlive. None of which is a good solution for following reasons:

TCP KeepAlive: this tries to detect failure by periodically sending a test packet on the socket. When the test packet does not go through this must mean that remote database is no longer alive. The problem with this approach is that it can not distinguish between temporary network glitch and database service truly going down. In fact TCP KeepAlive goes against TCP retry philosophy where TCP tries to hide temporary network problems by a way of retransmits. If you set TCP KeepAlive too low you may end up destroying perfectly good database connections. Equally important, setting KeepAlive effectively places a limit on how long your sql can run. If your sql runs too long, the KeepAlive will terminate connection. EDIT 05/27/2017: more testing shows that tcp keepalive does not place a limit on long running sql. The server tcp stack aknowleges keepalive  packet without involving database process.

TCP Read and Write Timeouts and Retries: The idea is to reduce number of retries and delays between retries to some reasonable lower values so that in the end the retry process would finish sooner. First problem – you can not go too low for the same reason as with KeepAlive because then you will run a risk of mistaking temporary network slowdown for database service down and tear down perfectly good connections. Secondly, the retries/delays manipulations do nothing for sessions already in sql call – because on those sessions there is nothing for TCP to retransmit – they are waiting to be contacted by the server.

TCP Connect Timeouts and Retries – these are helpful to set. But these help only new connections.

Conclusion – there is no really good way to completely avoid Client TCP hangs and timeouts when working with RDS, despite the fact that RDS did pretty good job in failing over database in just under 3 minutes. You can only somewhat limit the hangs for new connections and new calls on existing connections. There is little that can be done to help sessions with in-flight calls without hurting your app by too low KeepAlive. EDIT 05/27/2017: You can only limit the hangs to some reasonable intervals.


EDIT 05/27/2017: After more testing with various tcp kernel parameters and various connection strings I have arrived at following settings that limit tcp hangs to some reasonable values

Case 1: New connection request to an RDS instance in the process of failover.

When using EZConnect connection string the sqlplus hangs for ~1m10s before getting ORA-12170: TNS:Connect timeout occurred.

One way to control new connection timeout is to use tnsnames-style connection string with (CONNECT_TIMEOUT=10). I will give an example below after we discuss Case 3.

Case 2: Existing connection, New sql call to RDS instance in the process of failover.

In this situation TCP attempts to resend and goes into retries cycle. Default Linux kernel configuration has tcp_retries2 =15. This setting results in sqlplus hanging for ~15m30s before getting ORA-03113: end-of-file on communication channel. If it is set to 5 then retransmit timeout drops to 13 sec. To change, as root:

sysctl -w net.ipv4.tcp_retries2=5

For persistent changes edit /etc/sysctl.conf

Case 3: Existing connection already in sql call when RDS instance failovers

Without tcp keepalive in this situation sqlplus hangs forever. Because KeepAlives were considered controversial subject in TCP design, in order to enable keepalive on a socket the application needs to explicitly request it. Next, even after application explicitly requests keepalive, the default kernel settings result in first probe packet to be sent only after 2 hours. This means that kernel parameters also have to be changed to send probes more frequently.

Kernel changes:

sysctl -w net.ipv4.tcp_keepalive_time=30  # reduce from default 7200 seconds
sysctl -w net.ipv4.tcp_keepalive_probes=5 # reduce from 9 probes
sysctl -w net.ipv4.tcp_keepalive_intvl=3  # reduce from default 30 seconds

For persistent changes edit /etc/sysctl.conf

To request keepalive we need to use tnsnames-style connection with (ENABLE=broken) keyword (the example also uses CONNECT_TIMEOUT keyword to cover Case 1):

sqlplus 'username/psw@(DESCRIPTION=(CONNECT_TIMEOUT=10)(ENABLE=broken)
(ADDRESS = (PROTOCOL=TCP)(HOST=failover-test.c8ilntf5kdon.us-east-1.rds.amazonaws.com)(PORT = 1800))
(CONNECT_DATA = (SERVICE_NAME=ORCL)))'

With this change the Case 3 forever hang turns into 45 sec timeout.

Advertisements
Categories: AWS, EC2, Oracle RAC, RDS, TCP Timeouts

AWS EC2 CPU (in)consistency – Part 4. A fix : disable half vCPU ?

January 27, 2017 Leave a comment

In prior post in this series ( older posts: part1, part2, part3 ) I used simple integer increment shell test demonstrating that Amazon EC2 Linux instances exhibit inconsistent vCPU speeds whenever number of processes actively running on CPU becomes greater than half of available vCPU. The performance differences were very big – some processes run two times slower than others. I came to conclusion that this inconsistency is explained by observed fact that OS scheduler is not starting to rebalance running processes until after number of processes exceeds number of vCPU. Only after this point the scheduler rebalancing kicks in and process speeds become more or less close.

As a next step I tried to influence scheduler behavior by changing scheduler policy. Current day Linux uses Completely Fair Scheduler (CFS) which, without going into real-time options, leaves pretty much only 3 user controlled options : SCHED_NORMAL, SCHED_BATCH and SCHED_IDLE policies, controllable via chrt command (sched-design-CFS)

I tried all three and without going into extra details the end result was that only SCHED_IDLE demonstrated slight rebalancing in processes CPU assignments. Apparently when current shell was using SCHED_IDLE policy, other processes present on the system received higher priority and were able to preempt test processes, thus triggering rebalance. Obviously using SCHED_IDLE for normal workloads is not a good idea, so this can not be considered a viable option.

So back to square one after scheduling policy detour.

At this point I started to believe that Linux OS scheduler rebalancing strategy has a fundamental flaw where it leaves process running on same hyperthread where it started, regardless of whether its “sibling” hyperthread later become idling or busy. If all vCPU were equal than yes, there would be no point in rebalancing processes as long as number of vCPU is bigger than number of processes willing to run on CPU.

If this is true then what happens when we disable one subset of vCPU which share same core leaving the other subset online? If you think about it – if one Hyperthread based vCPU1 is already trying to completely use its core to run Process 1 – why keep insisting on squeezing in a second Hyperthread vCPU2 to run Process 2 on the same core ? Why not instead disable vCPU2 and have remaining vCPU1 serve both Process 1 and Process 2 ? There is no CPU cache sharing expected here so any theoretical benefit of hyperthreading is doubtful while drawbacks of not rebalancing are obvious.

With this reasoning I proceeded to disable half of vCPU :

-- determine vCPUs on same core
lscpu -a -e
	CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
	0   0    0      0    0:0:0:0       yes
	1   0    0      1    1:1:1:0       yes
	2   0    0      2    2:2:2:0       yes
	3   0    0      3    3:3:3:0       yes
	4   0    0      4    4:4:4:0       yes
	5   0    0      5    5:5:5:0       yes
	6   0    0      6    6:6:6:0       yes
	7   0    0      7    7:7:7:0       yes
	8   0    0      0    0:0:0:0       yes
	9   0    0      1    1:1:1:0       yes
	10  0    0      2    2:2:2:0       yes
	11  0    0      3    3:3:3:0       yes
	12  0    0      4    4:4:4:0       yes
	13  0    0      5    5:5:5:0       yes
	14  0    0      6    6:6:6:0       yes
	15  0    0      7    7:7:7:0       yes

-- disable one set of vCPUs
-- become root
cd /sys/devices/system/cpu
lscpu -a -e
for i in {8..15} ; do
echo 0 > cpu$i/online
done

lscpu -a -e
	CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
	0   0    0      0    0:0:0:0       yes
	1   0    0      1    1:1:1:0       yes
	2   0    0      2    2:2:2:0       yes
	3   0    0      3    3:3:3:0       yes
	4   0    0      4    4:4:4:0       yes
	5   0    0      5    5:5:5:0       yes
	6   0    0      6    6:6:6:0       yes
	7   0    0      7    7:7:7:0       yes
	8   -    -      -    :::           no
	9   -    -      -    :::           no
	10  -    -      -    :::           no
	11  -    -      -    :::           no
	12  -    -      -    :::           no
	13  -    -      -    :::           no
	14  -    -      -    :::           no
	15  -    -      -    :::           no

-- rerun the test again
for i in {1..32} ; do
lc $i 10
sleep 3
done | tee parallel_shell_disabled_vCPUs.log

-- while the test running monitor in another shell:
watch "ps -e -o user,pid,psr,comm,s | grep bash| grep R | sort -n -k3"

Here are the results, side by side :

disabling-vcpu-side-by-side-small

As we can see after second set of vCPU was disabled the consistency became much better in vCPU/2-vCPU range.

While the test was running, with ps I could also see that PSR field started changing immediately after number of processes became more than vCPU/2, meaning that OS scheduler started to rebalance.

So what does this all mean for average AWS customer ? Are we suggesting as a matter of best practice to disable half of vCPU ?

Here we come to difficult question.

On one hand, the test results are so obvious that the answer seems to be a nobrainer – yes, better to disable for response time consistency sake.

On other hand, this recommendation makes whole billing situation awkward – Amazon charges by hour based on instance class where instance class is tied to number of vCPU, so effectively it charges by vCPU. Then why would I pay for disabled vCPU ? If this situation is true – why is not everybody complaining ?

We must discuss here under what circumstances this effect will NOT be observed.

If we watch closely the ps PSR field when running the test, we notice that scheduler does very good job of INITIAL balancing of processes between available cores. For example, in N<vCPU/2 range you will never see more than one process per core, and in vCPU/2

To verify this effect I decided to run three more tests where CPU-intensive load is intermixed with some kind of waits.

pipe gzip shell test

-- to build graph
-- run in bash:

( for i in {1..32} ; do
   echo "Running $i parallel gzip";
   for ((k=0;k<$i;k++)); do         ( ( dd if=/dev/zero bs=1M count=2048 | gzip -c > /dev/null ) 2>&1 | grep bytes ) &
   done;
   wait ;
   sleep 1;
done)
        Running 1 parallel gzip
        2147483648 bytes (2.1 GB) copied, 15.7105 s, 137 MB/s
        Running 2 parallel gzip
        2147483648 bytes (2.1 GB) copied, 15.9782 s, 134 MB/s
        2147483648 bytes (2.1 GB) copied, 16.0045 s, 134 MB/s
        ...

This runs N background gzip commands having input via pipe from dd if=/dev/null. While most of the elapsed time is spent on CPU-intensive gzip, the pipe presence adds element of waits.

Here is plotted test result :

gzip_dd_pipe_test

As we can see, when CPU-intensive process has numerous (even brief) waits, the OS scheduler has a chance to rebalance and results are more consistent than in integer increment tests. (But while consistency is better, it is still far from uniform ).

Database CPU-intensive workload – long sql

This test starts N sqlplus sessions in background, waits for connections to establish then runs CPU-intensive sql and measures elapsed times. (See script …)

Left chart below shows how sql elapsed times fluctuated when number of active sqlplus sessions N was in vCPU/2

Middle chart shows that disabling half of vCPU considerably reduced fluctuations by letting scheduler to rebalance, while at the same time maintaining overall throughput..

Right chart shows comparison for non-virtualized on-premise environment.

disable_vcpu_side_by_side

Database CPU-intensive workload – short sql

The reason to run this test was that during long-running sql testing I observed that rebalancing if happened it happened every couple of seconds. Therefore, for short-running sqls there may not be enough time for rebalance to kick in and make a difference.
To test this I changed sql to run less time. As shown below, for couple seconds duration sql workloads the disable half vCPU fix made only marginal improvement:

disable_vcpu_short_sql

Conclusion

As we can see, vCPU  speed inconsistency is workload dependent:

1) when the workload is CPU only, the observed variability is 100% whenever number of active processes exceeds vCPU/2 and below vCPU. 100% variability means that some processes run twice as slow as others.

2) when the workload is a mixture of CPU and waits ( IO, network, IPC communications, etc) then the effect is less, but still noticeable.

Potential application impacts may be : jumpy response times, timeouts.
Potential database situations may be: bugs where sessions starts spinning on CPU waiting on a latch; sql optimizer choosing CPU-intensive plan like hash joins.

The behavior is explained by Linux OS scheduler not willing to rebalance running processes between vCPUs after initial balancing. Since AWS provisions vCPU as Intel Xeon HyperThread and two HyperThreads are sharing single physical core, this means that in the absence of rebalancing the process speed depends on whether its vCPU shares a core with another CPU-intensive process or not.

Preventative measure 1: size AWS instance big enough that number of active processes never exceeds vCPU/2 (this in effect means never letting aggregated CPU utilization go above 50%)

Preventative measure 2: if you have to run more parallel active processes than number of cores, then consider disabling half of vCPU to improve system predictability.

AWS EC2 CPU (in)consistency – Part 3. Simple Shell test.

January 26, 2017 1 comment

 

In this blog post I will describe a simple shell script for measuring CPU consistency.

In my prior tests I used sqlplus sessions running a CPU-intensive SQL in parallel background processes. That approach revealed that there was something strange in the way vCPU behaved in AWS EC2 environment. I noticed that whenever number of parallel sqlplus sessions became one more than vCPU/2 there were always two unlucky sessions which ran substantially slower than the rest of sessions – up to 50% slower. Obviously this was not good because SQL elapsed time consistency is very important.  I also noticed that while these parallel processes were running – their per CPU assignment in “top” never changed. Since we know that AWS EC2 presents Hyperthread as vCPU and that Intel Xeon has two Hyperthreads per physical Core – it is clear that if two sessions share one core they will run slower than a session running on dedicated core. Question is – why processes do not move between cores ? Is not this a job of OS scheduler to give all processes fair share of CPU usage ? This observation meant that there was something fundamental related to OS scheduling which lead to inconsistent SQL elapsed time.

If this was true, then this behavior must reveal itself in other situations outside of SQL and RDBMS.

I decided to try and use only shell script to generate CPU load.

Read more…

AWS RDS and EC2 CPU performance (in)consistency – Part 2

January 10, 2017 1 comment

After discovering that in my prior tests there was significant variable factor of sqlplus session connect time (which still needs to be researched but this is for a later time), I decided to try and isolate this factor so that the test would be focused more on sql elapsed time as a measure of CPU performance.

I modified the script to have large 15 sec delay to setup connections and also to separately report session creation times and sql execution times. I also printed “ps” PSR field for corresponding server processes in a hope to spot any dependency. The test sql was also modified in order to increase elapsed time :

with t as ( SELECT rownum FROM dual CONNECT BY LEVEL &lt;= 200 )
select /*+ ALL_ROWS */ count(*) as cnt from t,t,t,t ;

Full script is listed in the end of this post.

With that on c4.4xlarge EC2 instance (i.e. 16 vCPUs / 8 cores / 16 Hyperthreads ) I am observing following:

When number of parallel sessions is less or equal to 8, all sessions sql elapsed time is consistent at about 40 seconds. For example with 8 sessions:

8-sessions

However when I add one more session to make it 9, the sql elapsed time on two sessions out of 9 increases by 50%  :

9-sessions

I interpret this as additional session starting to share a core with one of the other 8 sessions, thus making the two sessions run slower.

Read more…

AWS RDS and EC2 CPU performance (in)consistency

January 5, 2017 1 comment

After our company decided to join public cloud bandwagon and move its databases to AWS, I got curious of what exactly we are getting in terms of CPU performance in RDS or EC2. I asked my friend who had already established Oracle instance in AWS RDS to run same CPU-intensive SQL which I previously used to compare various database platforms as described here.

The test SQL is very simple:

with t as ( SELECT rownum FROM dual CONNECT BY LEVEL <= 100 )
select /*+ ALL_ROWS */ count(*) as cnt from t,t,t,t ;

rds-jigsawThe sql generates 100 rows and then joins the resultset to itself 4 times producing 100,000,000 records. The beauty of this sql is that it does not generate any IO so elapsed time depends only on CPU and RAM speed.

What my friend observed when running this on RDS was that in general CPU performance was what is expected from modern Intel Xeon E5-xxxx processors, with understanding that Amazon vCPU count is a hyper thread count and not real CPU core count. One unusual behavior however was a jigsaw pattern where there was a substantial performance drop in the overall growth chart, like in the chart on the right.

On closer examination it turned out that there is noticeable variation in test sql execution elapsed time when parallel sessions go above 1-2. So the throughput chart would not reproduce itself on different runs.

This led me to investigate AWS vCPU speed consistency by continuously executing test sql in parallel sqlplus sessions over period of time and measuring elapsed time variations.

Read more…

Categories: AWS, CPU speed, EC2, RDS, Virtualization

Moving 8TB database to Flash Storage

December 7, 2015 Leave a comment

In recent years we all get accustomed to all kinds of virtualization – VM Ware, Solaris Zones, Solaris Domains, etc… While mechanics may be different, one common trait remains – try and pile up more applications and databases into common set of resources.

One flavor of virtualization is Storage Arrays. Multiple databases and applications get their slices of big Storage Array in the form of LUN. As it happens when big companies in pursuit of efficiency turn every process into a conveyer, the storage provisioning planning is frequently reduced to LUN sizes. In the end you end up with mysterious storage slowdowns when one or more of “other” apps quietly rump up IO utilization to a point where Storage Array cache becomes saturated. Because you have no visibility into what else is running on the same storage – you are left just wondering. That’s the downside of “virtualize everything” trend.

I will have another post about practical ways to detect present and past storage slowdowns.

Here I wanted to share a remarkable IO improvement we got after moving to a Flash storage array.

Read more…

Using EM 12c repository to find ASM growth rate

January 9, 2013 1 comment

ASM disk group growth from EM 12c repository

Our company collocates multiple databases on Linux RAC grids to reduce hardware, OS and HA license expenses. All the databases on a grid are sharing few ASM disk groups.

Sharing disk groups helps to reduce storage overallocation commonly seen when each database is running on its dedicated storage, be it file system-based or ASM-based. Of course, the databases continue to grow, but the hope is that with shared storage the overallocation can be reduced due to extremes canceling out each other.

In this environment, when an ASM disk group usage does grow, the question becomes how to track which database is causing most of the growth.

Unfortunately EM 12c ASM disk group page shows only last 7 days of individual database sizes in a particular data group. For capacity planning purposes you need more than that.

One solution is to query EM 12c repository directly.

Read more…