StarQuest Technical Documents

Case Study: Troubleshooting Slow Replication (due to Rollback behavior)

Last Update: 27 December 2021
Product: SQDR & SQDR Plus
Version: 5.22 and later
Article ID: SQV00PL065

Abstract

This technical document describes the steps taken to troubleshoot a throughput issue at a customer site. The cause and solution for this particular incident are rather specific, but the techniques used to identify the cause are useful as an example.

Often the cause of problems are related to the tier 4 destination - e.g. the destination database is out of log or disk space, or a DBA has inadvertantly removed a primary key or unique index on the destination, resulting in updates having to do a full table space.

Another cause is an environmental problem on tier 2 or 3 - disk or memory starvation, full logs for the Db2 staging database, etc.

However, in this case there were no issues on tiers 2, 3, or 4. Through the investigative methods described below, we were able to determine that the cause was a large number of single row rollbacks performed on a particular table by an SQL job on the host, and to recommend a solution.

Solution

A replication solution from an IBM i (iSeries) source that had been working without problems for 2 years showed as congested and the end users of the destination reported that they had not received data for several days.

Using Control Center, we could see:

  • There were no errors in the agent diagnostics.
  • The summary for the agent showed that the SQDR Plus agent was reading from a journal receiver from around midnight several days earlier, and though the record count was moving, it was proceeding very slowly - it would take 10-15 minutes to advance 3 seconds.
  • Looking at the statistics from the time indicated by summary, we saw that T1 Read KBytes had increased and stayed maxed out at 80GB per hour (i.e. the agent was reading a lot of data from the host, and there is no network problem) but T1 Log Entries showed that the agent was only processing a few thousand records per hour.

In a terminal (green screen) session:

  • Using WRKOBJ <receiver-library/*ALL *JRNRCV, we can see that there a total of 39 journal receivers (each ~900mb) waiting to be processed, and quite a few from the time shown in the summary - i.e. there had been a lot of database activity at midnight several nights earlier.
  • Using WRKACTJOB & WRKSYSSTS, we can see that there are no CPU or RAM issues on the IBM i system.

The Agent summary displays the journal receiver and position that the agent is currently working on. Using DSPJRN to examine the journal receiver at that point, we could see that a particular job was performing the following activity on a particular table, over and over again: "start transaction", "insert". "the undo-delete", and then "rollback".

We made the following changes to the agent config:

pollSleepTimer from 1 sec -> 500
added property rollbackLookAhead - false

The meaning of these properties is:

pollSleepTimer: Specifies the interval (in milliseconds) at which the Journal Reader should poll the journal for changes. If the parameter is not present, the default value is 5000. For an Oracle source, the recommended value is 1; this property is automatically set in the configuration when an Oracle agent is created.

rollbackLookAhead: Setting this property to false disables the default look-ahead behavior. This may be desirable if there are there are many small (single row) rollbacks in the application usage pattern rather than the more typical pattern of large rollbacks.

This helped a bit, but not enough to overcome the backlog of journal receivers that had accumulated.

Because of the large number of tables and the size of some of them, a baseline of the entire solution was not practical. But now that we know the name of the problem table (displayed in DSPJRN), we realized that it was a small table (~100k rows), so running a baseline for that one table would be acceptable.

We used the Control Center to take the table offline by selecting the subscription and choosing Offline from the Resource menu. That made a big difference and we started seeing more progress in the Agent summary.

After several hours, the agent was caught up - i.e. the summary showed it was processing a recent journal receiver, and changes for the other tables had been received and applied.

After putting the table online (by selecting the subscription and choosing Online from the Resource menu), the agent flagged a Baseline Required condition. Depending on the I/R group settings, the baseline can happen automatically or has to be run manually.

The final task was to contact the owner of the rogue SQL job that created all the journal receiver records, and ensure that it would not happen again.

 

 

 



DISCLAIMER

The information in technical documents comes without any warranty or applicability for a specific purpose. The author(s) or distributor(s) will not accept responsibility for any damage incurred directly or indirectly through use of the information contained in these documents. The instructions may need to be modified to be appropriate for the hardware and software that has been installed and configured within a particular organization.  The information in technical documents should be considered only as an example and may include information from various sources, including IBM, Microsoft, and other organizations.