Archive for the ‘Oracle RAC’ Category

Testing Client TCP Timeouts with Multi-AZ Oracle RDS Failover

While preparing for recent “RAC Benefits” presentation I did a quick test to see how a client behaves when Multi-AZ Oracle RDS instance fails in the middle of sql execution or between consecutive sql executions.

For the client I took a standard AMI provided by my company, created an EC2 instance and installed Oracle Client v

For the db server I created m4.large RDS instance with Multi-AZ option.

Then I connected sqlplus to the RDS and ran following pl/sql block to simulate long running SQL execution :    
execute DBMS_LOCK.sleep(30)

Then I triggered RDS failover to another AZ  and re-executed the pl/sql again in the same session.

RDS failover took 2 min 49 sec as can be seen in AWS Console:


The client situation was not as good though.

The next call submitted in the same sqlplus session hang for more than 15 minutes :


While the sqlplus was hanging, netstat on the client showed that TCP socket it was using went into TCP Retries :netstat_timeout

The client got out of TCP wait only after 15 min 35 sec, even though RDS finished failover in only 2:49 min.

Another test was to trigger RDS failover in the middle of running pl/sqlsql call.

In this case the client went into indefinite hang.

This is something definitely requiring attention in application HA planning.
The reason I did this test was to compare client behavior in Multi-AZ RDS to RAC failover situation.

With RAC we have Virtual IP, Fast Connection Failover (FCF), Fast Application Notification (FAN) to help the client session to avoid getting into TCP timeouts and TCP hangs.

There are 3 places where TCP retries and hangs can happen when the other side went down:

  1. new connection request
  2. new sql call in existing connection (like the 15+ mins hang demonstrated above))
  3. already executing call

The first case can be dealt with by using various connection retry and connection timeout parameters set either in tnsnames.ora or in connection URL

The second case in RAC is taken care of by the fact that connections are made to VIP and not to physical IP. When the VIP is transferred to another node, it is associated with new MAC so the client on next call gets immediate “connection reset” message due to MAC mismatch, instead of going into TCP retries.

The third case requires using event-driven mechanism like FCF with FAN. The reason is that when the client already submitted sql request to the db, it is now waiting to be contacted by the server with results. So TCP on the client is not even trying to help with retries. This is why RAC solution is for RAC to send FAN event to the client telling it that db instance is no longer there. This is done with the help of ONS daemons on the server side and a special FCF thread running on the client UCP connection pool which listens on FAN events coming from the server and in response marking client connection as invalid.

Above instrumentation is lacking in RDS database. This is the reason for observed client hang.

What is the solution then if we use RDS ?

A proper solution would be event-driven approach where RDS would send some kind of message in case of failover. In the absence of this message we have to rely on the client to detect failure.

Traditionally people were trying to either manipulate various TCP socket timeouts and retries or use TCP KeepAlive. None of which is a good solution for following reasons:

TCP KeepAlive: this tries to detect failure by periodically sending a test packet on the socket. When the test packet does not go through this must mean that remote database is no longer alive. The problem with this approach is that it can not distinguish between temporary network glitch and database service truly going down. In fact TCP KeepAlive goes against TCP retry philosophy where TCP tries to hide temporary network problems by a way of retransmits. If you set TCP KeepAlive too low you may end up destroying perfectly good database connections. Equally important, setting KeepAlive effectively places a limit on how long your sql can run. If your sql runs too long, the KeepAlive will terminate connection. EDIT 05/27/2017: more testing shows that tcp keepalive does not place a limit on long running sql. The server tcp stack aknowleges keepalive  packet without involving database process.

TCP Read and Write Timeouts and Retries: The idea is to reduce number of retries and delays between retries to some reasonable lower values so that in the end the retry process would finish sooner. First problem – you can not go too low for the same reason as with KeepAlive because then you will run a risk of mistaking temporary network slowdown for database service down and tear down perfectly good connections. Secondly, the retries/delays manipulations do nothing for sessions already in sql call – because on those sessions there is nothing for TCP to retransmit – they are waiting to be contacted by the server.

TCP Connect Timeouts and Retries – these are helpful to set. But these help only new connections.

Conclusion – there is no really good way to completely avoid Client TCP hangs and timeouts when working with RDS, despite the fact that RDS did pretty good job in failing over database in just under 3 minutes. You can only somewhat limit the hangs for new connections and new calls on existing connections. There is little that can be done to help sessions with in-flight calls without hurting your app by too low KeepAlive. EDIT 05/27/2017: You can only limit the hangs to some reasonable intervals.

EDIT 05/27/2017: After more testing with various tcp kernel parameters and various connection strings I have arrived at following settings that limit tcp hangs to some reasonable values

Case 1: New connection request to an RDS instance in the process of failover.

When using EZConnect connection string the sqlplus hangs for ~1m10s before getting ORA-12170: TNS:Connect timeout occurred.

One way to control new connection timeout is to use tnsnames-style connection string with (CONNECT_TIMEOUT=10). I will give an example below after we discuss Case 3.

Case 2: Existing connection, New sql call to RDS instance in the process of failover.

In this situation TCP attempts to resend and goes into retries cycle. Default Linux kernel configuration has tcp_retries2 =15. This setting results in sqlplus hanging for ~15m30s before getting ORA-03113: end-of-file on communication channel. If it is set to 5 then retransmit timeout drops to 13 sec. To change, as root:

sysctl -w net.ipv4.tcp_retries2=5

For persistent changes edit /etc/sysctl.conf

Case 3: Existing connection already in sql call when RDS instance failovers

Without tcp keepalive in this situation sqlplus hangs forever. Because KeepAlives were considered controversial subject in TCP design, in order to enable keepalive on a socket the application needs to explicitly request it. Next, even after application explicitly requests keepalive, the default kernel settings result in first probe packet to be sent only after 2 hours. This means that kernel parameters also have to be changed to send probes more frequently.

Kernel changes:

sysctl -w net.ipv4.tcp_keepalive_time=30  # reduce from default 7200 seconds
sysctl -w net.ipv4.tcp_keepalive_probes=5 # reduce from 9 probes
sysctl -w net.ipv4.tcp_keepalive_intvl=3  # reduce from default 30 seconds

For persistent changes edit /etc/sysctl.conf

To request keepalive we need to use tnsnames-style connection with (ENABLE=broken) keyword (the example also uses CONNECT_TIMEOUT keyword to cover Case 1):

sqlplus 'username/psw@(DESCRIPTION=(CONNECT_TIMEOUT=10)(ENABLE=broken)

With this change the Case 3 forever hang turns into 45 sec timeout.

Categories: AWS, EC2, Oracle RAC, RDS, TCP Timeouts