Home > Oracle Grid Control > Using Grid Control 10g to Build a Cumulative Counter Based Graph

Using Grid Control 10g to Build a Cumulative Counter Based Graph

redo write speed graphOur redolog disk devices were experiencing periodical sudden slowdowns, but diagnosing the issue proved to be tricky. Yes it is possible to use Unix tools such as iostat –x or sar –d to collect device service times, but in our case the storage is a NAS array with 190 devices mounted by a Volume Manager. This  makes it difficult to map exactly what devices are used for redologs.

We wanted to get an idea of what redo device speed looks like from the database perspective and how it changes over time.

Unfortunately EM 10g does not have a built-in metric to monitor redo write speed.

The system view v$sysstat maintains cumulative counters ‘redo write time’ and ‘redo writes’. The average LGWR write speed since instance startup can be calculated as


SQL> select /*+ all_rows */ t.value*10/w.value as ms_per_write
from
(select value from v$sysstat
where statistic# = (select statistic# from v$statname where name = 'redo writes')
) w,
(select value from v$sysstat
where statistic# = (select statistic# from v$statname where name = 'redo write time')
) t;
MS_PER_WRITE
------------
9.4988199536

The average redo speed since startup, while helpful, is still far from what we need to verify a suspected temporary slowdowns.

For this purpose we need a time graph, and a time graph is something that can be derived from cumulative counters using deltas.

Unfortunately EM 10g User-Defined Metrics do not allow calculation of deltas on cumulative counter data sources. This is a step back from OEM 9i where Capacity Planner allowed Deltas.

As a workaround we will try using Grid Control User-Defined Reports (keep in mind that EM Dbconsole does not have Reports functionality).

The idea is that since GC Repository collects raw historical measurements, we can try to use it to calculate deltas with a sql.

Here is an attempt to build a graph for LGWR write speed as a function of ms per redo write.

First we define a User-Defined Metric named ‘redo write speed’ to make EM Agent to record counters periodically. The metric will be of type NUMBER, with 2-columns ‘SQL Query Output’ and using the above query with additional ‘key’ field and frequency of every 5 minutes.

Next we will give the agent some time to collect measurements. We can then check that the collection process has started by going to User-Defined SQL Metrics :

collection started

As expected the metric only goes up because v$sysstat keeps cumulative counters.

Next we have to do some analysis and find out where in the EM Repository the above data is stored.

As it turns out, the historical metrics are stored in mgmt_metrics_raw table. Not surprisingly, this table is huge – on my repository it is over 35 million records. Probably for this reason it has a bare minimum of columns. To find necessary records we need to find target_guid and metric_guid.

This is where mgmt$metric_current view comes in handy. It stores latest available measurements and therefore is much smaller and probably for this reason it has additional columns with human-readable target and metric names. Our user-defined metric in this table has metric_name=’SQLUDMNUM’ and column_label equal to the name we have given to the metric when we defined it in the GUI.


select target_guid, metric_guid, metric_label, collection_timestamp,
value, key_value, key_value2
from mgmt$metric_current t
where target_name = 'ewptsdb.prod'
and metric_name = 'SQLUDMNUM'
and column_label = 'redo write speed';

TARGET_   METRIC_ METRIC_LABEL             COLLECTION_TIMESTAMP VALUE    KEY_VALUE        KEY_VALUE2
GUID      GUID
------- --------- ------------------------ -------------------- -------- ---------------- ----------------
1E..52B DEE...6CB User-Defined SQL Metrics 11/1/2010 12:03:25 P 38167788 redo write speed redo write time
1E..52B DEE...6CB User-Defined SQL Metrics 11/1/2010 12:03:25 P 24181791 redo write speed redo writes

(output edited to fit page)

The table appears to have everything we need in order to extract the necessary records from mgmt_metrics_raw : target_guid and metric_guid. However, there is one missing piece – key_value. Without it we cannot distinguish between ‘redo writes’ and ‘redo write time’ in raw table.

To find it we will go to mgmt_metrics_composite_keys :


select * from mgmt_metrics_composite_keys t where key_part1_value = 'redo write speed';
COMPOSITE_KEY      TARGET_GUID      KEY_PART1_VALUE   KEY_PART2_VALUE
------------------ ---------------- ----------------- -----------------
665150DAFEDA1..EE7 1EE13EF90AA..52B redo write speed  redo write time
B940B603BF8C4..99D 1EE13EF90AA..52B redo write speed  redo writes
665150DAFEDA1..EE7 0F808E7DB8A..639 redo write speed  redo write time
B940B603BF8C4..99D 0F808E7DB8A..639 redo write speed  redo writes
(output edited to fit page)

And finally we can extract our records from the raw table:


with ids as
(     select c.target_guid, c.metric_guid, k.composite_key, c.key_value2
from mgmt$metric_current c, mgmt_metrics_composite_keys k
where c.metric_name = 'SQLUDMNUM'
and c.column_label = 'redo write speed'
and c.target_name = 'ewptsdb.prod'
and k.key_part1_value = c.column_label
and c.target_guid=k.target_guid and c.key_value2 = k.key_part2_value
)
select r.collection_timestamp, r.value, ids.key_value2
from mgmt_metrics_raw r, ids
where r.metric_guid =ids.metric_guid and r.target_guid = ids.target_guid
and r.key_value = ids.composite_key
order by r.collection_timestamp, ids.key_value2;

COLLECTION_TIMESTAMP      VALUE KEY_VALUE2
-------------------- ---------- ---------------
...
10/31/2010 10:03:25    35346431 redo write time
10/31/2010 10:03:25    18325516 redo writes
10/31/2010 10:08:25    35369837 redo write time
10/31/2010 10:08:25    18337582 redo writes
10/31/2010 10:13:25    35393200 redo write time
10/31/2010 10:13:25    18350390 redo writes
10/31/2010 10:18:25    35416604 redo write time
10/31/2010 10:18:25    18363342 redo writes
...

Notice how both measurements come in pairs. This is because we defined two-elements User-Defined Metric with single sql. Whenever the data is collected by Agent, both elements always get the same timestamp. This is important, because knowing this we can use LAG function to “peek” into previous records to get deltas.


with ids as
( select c.target_guid, c.metric_guid, k.composite_key, c.key_value2
from mgmt$metric_current c, mgmt_metrics_composite_keys k
where c.metric_name = 'SQLUDMNUM'
and c.column_label = 'redo write speed'
and c.target_name = 'ewptsdb.prod'
and k.key_part1_value = c.column_label
and c.target_guid=k.target_guid and c.key_value2 = k.key_part2_value
)
--
select collection_timestamp, (lag1_value-lag3_value)*10/(value-lag2_value) as ms_per_write
from (
select r.collection_timestamp, r.value, ids.key_value2 as ,
--
lag(r.value,1) over (order by r.collection_timestamp, ids.key_value2) as lag1_value,
lag(ids.key_value2,1) over (order by r.collection_timestamp, ids.key_value2) as lag1_key_value2,
lag(r.collection_timestamp,1) over (order by r.collection_timestamp, ids.key_value2) as lag1_collection_timestamp,
--
lag(r.value,2) over (order by r.collection_timestamp, ids.key_value2) as lag2_value,
lag(ids.key_value2,2) over (order by r.collection_timestamp, ids.key_value2) as lag2_key_value2,
lag(r.collection_timestamp,2) over (order by r.collection_timestamp, ids.key_value2) as lag2_collection_timestamp,
--
lag(r.value,3) over (order by r.collection_timestamp, ids.key_value2) as lag3_value,
lag(ids.key_value2,3) over (order by r.collection_timestamp, ids.key_value2) as lag3_key_value2,
lag(r.collection_timestamp,3) over (order by r.collection_timestamp, ids.key_value2) as lag3_collection_timestamp
from mgmt_metrics_raw r, ids
where r.metric_guid =ids.metric_guid and r.target_guid = ids.target_guid
and r.key_value = ids.composite_key
order by r.collection_timestamp, ids.key_value2
)
where key_value2 = 'redo writes';

COLLECTION_TIMESTAMP MS_PER_WRITE
-------------------- ------------
...
10/31/2010 10:03:25  21.295074758
10/31/2010 10:08:25  19.398309298
10/31/2010 10:13:25  18.240943160
10/31/2010 10:18:25  18.069796170
10/31/2010 10:23:25  18.131868131
10/31/2010 10:28:25  15.837384626
10/31/2010 10:33:24  16.331546981
10/31/2010 10:38:25  12.770383770
...

Bingo!

This is exactly what we wanted.

As we can see, for a redo device to have a latency of 15-20ms is not at all stellar performance (and this is why this whole exercise started in the first place). Ideally, we would like to see something in the single digits. Redo is something that affects the database as a whole, because any change to the db has to be recorded in the redo logs. Most of the redo writes are buffered in the log buffer, but during commit the database does not return control to the application until writes get recorded to disk.

But let’s get back to our exercise.

Now we can go to the Report section in Grid Control and use our sql to create a User-Defined Report to see the graph.

When we do this we can get ORA-00942: table or view does not exist.

This is because Grid Control Reporting uses MGMT_VIEW account.

To overcome the error we need to give MGMT_VIEW user access to the tables referenced in the sql:


grant select on sysman.mgmt_metrics_raw to mgmt_view;
grant select on sysman.mgmt$metric_current to mgmt_view;
grant select on sysman.mgmt_metrics_composite_keys to mgmt_view;

In the Elements tab of  Report Definition we will add ‘Chart from SQL’ element and in its ‘Set Parameters’ page we will paste above sql with a slight modification to use ??EMIP_BIND_TARGET_GUID?? dynamic parameter to allow the report to be used for different database targets.

And as a final touch to allow a report viewer to pick a time range we will add this as a last line of the sql :


…
and collection_timestamp between ??EMIP_BIND_START_DATE?? and ??EMIP_BIND_END_DATE??

And finally we have a nice milliseconds per redo write graph in GC Report :

For those of you who are wondering about what happened to the redo speed at about 4:30pm – it is the moment when our storage was converted to use EMC SRDF Adaptive Copy-Disk mode.

Before that it was using Synchronous transfer mode.

As you can see, the change made a big difference – from 10-20ms to 2-3ms per write.

For online redo log devices this is significant improvement.

Prior to the SRDF change our database routinely had up to 80 sessions waiting for ‘log sync” to complete.

After the change this wait was reduced by 3-4 times and the number of active sessions was reduced by 40%.

And last but not least – VCS is not triggering failovers anymore (knock on wood ! )

Cheers

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: