Archive for the 'DBWR Performance' Category

Over-Configuring DBWR Processes – Part III

Sometimes the comment thread on a blog entry are important enough to warrant yet another blog entry. This is one of those times. In my blog entry called Over-Configuring DBWR Processes Part II I offered some test results showing that it really is best to keep a single writer until such time as a single writer becomes CPU-bound. But first a disclaimer.

DISCLAIMER: There are ports of Oracle where specialized multiple DBWR code is implemented for specific hardware architecture (e.g., NUMA). I ought to know, I worked on one of the very first (if not the first) to do so with the ports to Sequent DYNIX/ptx.

There, disclaimer aside, I am talking about the Linux ports of 10g and 11g which have no such hardware specialization.

The first question in that comment thread was:

From what’s written here, it looks like you only tested 1 and 8. Is it possible that the global maximum is actually somewhere between? Maybe 2 processes is slightly faster than 1, for example?

Great question with a simple answer: use 1 writer as long as that single writer has ample bandwidth. That is a truth, but I figured I’d do some runs to milk out a fact or two. I configured 2 DBWR processes and re-ran the workload I describe in Part II of this series. I followed that up with 4 DBWR processes. Hey, it’s all auto-pilot stuff with my setup so doing this was no big deal. Let’s compare the results. Just to summarize, the previous configurations performed as follows:

Configuration

OLTP Throughput (TPM)

Aggregate DBWR CPU Usage

1 DBWR

62.6

291

8 DBWR

57.9

256

And now with the 2 and 4 DBWR cases:

Configuration

OLTP Throughput (TPM)

Aggregate DBWR CPU Usage

1 DBWR

62.6

291

2 DBWR

58.7

273

4 DBWR

58.4

269

8 DBWR

57.9

256

The way this goes is that more DBWR processes impact throughput. On the other hand, more throughput will require more DBWR work so the optimal case of 1 DBWR process will take more CPU, but remember that it is also getting more throughput. I only provide these CPU numbers to show what percentage of a single CPU any given configuration utilizes. I don’t think it is necessary to run the 6 DBWR case. But what about a wildcat configuration such as multiple DBWR processes that have hard processor affinity?

Processor Affinity and Oracle Background Processes
Well, I’d never bind an Oracle background process to a singe processor (core) for obvious reasons, but with the Linux taskset(1) command it is possible to bind a running process to a socket. For example, the following command binds pid 13276 to the seconds socket of a mulit-socket system:

# taskset –pc 2-3 13276

Building on that idea, you can grab DBWR processes (or any process for that matter) once they are running and pin them to a CPU core or a set of CPU cores. The following stupid bash shell function can be used to do this sort of thing. Note, since there are tens of thousands of skilled computer programmers out there these days, all shell scripts are considered stupid of course. The function takes an argument that it then uses in its grep regular expression for globbing ps(1) output. An example would be:

aff ora_dbw
aff()
{
local GLOB="$1"
local CORE=0
local TWIN=0

ps -ef | grep $GLOB | grep -v grep | awk '{ print $2 }'  | while read PID
do
        (( TWIN = $CORE + 1 ))

        set -x
        sudo taskset -pc ${CORE}-${TWIN} $PID
        set -    

        (( CORE = CORE + 2 ))
        [[ $CORE = 8 ]] && CORE=0 && TWIN=0
done
}

I used this script to affinity 2 DBWR processes to a single socket and the numbers didn’t change. Neither did throwing 2 DBWR processes one each to two different sockets. The fact simply remains that DBWR doesn’t benefit from having peers unless it is CPU-bottlenecked.

Over-Configuring DBWR Processes – Part II.

In my recent blog entry about over-configuring DBWR processes, I mentioned the processor cache effect of having too many DBWR processes doing work that a single DBWR process could handle. I’d like to provide more detail on the matter.

Backgrounder
Few people really understand what DBWR does for a living. Yes, everyone knows DBWR is responsible for flushing modified buffers from the SGA, but that is really high level. DBWR:

  • 1. Builds “write batches” by examining the state of buffer headers on the LRU lists. It scans the LRU end considering the age of the buffer. At instance boot time, each DBWR process is assigned a set of LRUs to tend to.
  • 2. Manipulates cache buffers chains elements for such reasons as marking the buffer busy when it is in flight for an I/O and marking it as written after the flush operation.
  • 3. Performs I/O using the OSDs that are available to it. OSDs are routines in the Oracle kernel that are Operating System (port) specific. These are the routines through which Oracle gets support for asynchronous and/or direct I/O via calls to various libraries (e.g., libc, libaio, libodm, etc).
  • 4. Posts waiters, such as foreground processes waiting in a free buffer wait event.

The problem with over-configuring DBWR processes is two-fold:

  • 1. Processor Cache Trashing. DBWR processes have huge working sets since they deal with so many LRU and cache buffers chains structures. Having more of these processes than necessary migrating from CPU to CPU places a huge dent in the L2 caches with both page table and data footprint.
  • 2. Scheduling anomalies. If you have more than 1 DBWR process, you run the increased risk of having DBWR processes that can’t get on CPU. Consider a checkpoint. All LRUs will have modified buffers that need flushed during a checkpoint. Any given LRU can only be cleaned by the boot-time assigned DBWR process. That means that a checkpoint requires action from all DBWR processes. It’s easier for 1 process to get scheduled on a CPU on, say, an 8 CPU system than 8 processes.

Point 2 above is the part that I suspect most folks aren’t aware of. I think folks envision the various multiple DBWR processes will only get busy if their peers aren’t keeping up. That is not the case. LRUs get evenly loaded by foreground processes because when a process needs a buffer (for file I/O, sort buffer, block cloning, etc) they first try the LRU latch for the last LRU they used (improves cache warmth) and if they can’t get that latch with a no-wait get, they jump to a different latch and do another no-wait get. This continues until they either obtain an LRU on a no-wait get or they loop around to the first attempted latch and do a wait-get. I recall in older releases of Oracle the next LRU to try was the very next LRU. That could have changed to something like a random selection but I don’t know ( I haven’t seen the code in years but that’s about to change). No matter, the LRUs are evenly loaded. Since LRUs are evenly loaded, that means that DBWR processes are equally tasked. I’m not very smart, but this I know:

On SMP systems it is not wise to break up a task that a single CPU can do and distribute the work evenly amongst multiple CPUs. You introduce processor scheduling overhead and you hammer the processor caches

How true, and even more so when the processes share resources as DBWR processes do-they “share” cache buffers chains and other goodies. Anyway, it might be smart to do such task decomposition on realtime or other specialized systems, but not Oracle systems. Let’s take a closer look.

Moderate OLTP Load, db_writer_processes=8
I just did an OLTP run on my DL585 8-core system with db_writer_processes set to 8. After the instance is booted, but before the Pro*C OLTP clients are triggered, I did a ps(1) command grepping for all the DBWR processes. As soon as the 900 second run was complete, the driver script does the same ps(1) command. All output goes to nohup.out so I can grep for DBWR processes and see how much CPU time they consumed during the run:

$ grep ora_dbw nohup.out
oracle   14507     1  2 13:46 ?        00:00:01 ora_dbw0_bench1
oracle   14509     1  2 13:46 ?        00:00:01 ora_dbw1_bench1
oracle   14511     1  2 13:46 ?        00:00:02 ora_dbw2_bench1
oracle   14513     1  1 13:46 ?        00:00:01 ora_dbw3_bench1
oracle   14515     1  1 13:46 ?        00:00:01 ora_dbw4_bench1
oracle   14517     1  2 13:46 ?        00:00:01 ora_dbw5_bench1
oracle   14519     1  1 13:46 ?        00:00:01 ora_dbw6_bench1
oracle   14521     1  2 13:46 ?        00:00:01 ora_dbw7_bench1
oracle   14507     1  2 13:46 ?        00:00:23 ora_dbw0_bench1
oracle   14509     1  2 13:46 ?        00:00:23 ora_dbw1_bench1
oracle   14511     1  2 13:46 ?        00:00:23 ora_dbw2_bench1
oracle   14513     1  2 13:46 ?        00:00:23 ora_dbw3_bench1
oracle   14515     1  2 13:46 ?        00:00:23 ora_dbw4_bench1
oracle   14517     1  2 13:46 ?        00:00:23 ora_dbw5_bench1
oracle   14519     1  2 13:46 ?        00:00:23 ora_dbw6_bench1
oracle   14521     1  2 13:46 ?        00:00:23 ora_dbw7_bench1

The ps(1) output shows us that each of the 8 DBWR processes accumulated 23 seconds of CPU. This clearly reflects the fact that all LRUs are loaded evenly and that each DBWR has an equal part in keeping them clear. So what’s the point? Well, 8 DBWR processes accumulating 184 CPU seconds (23*8) during a 7200 (8 * 900) CPU-second OLTP run means this DBWR workload could have been serviced by a single CPU. In fact 184 CPU seconds is only 20% of a single CPU core during a 900 second run, yet by configuring 8 DBWR processes we spread that workload over multiple CPUs. I know, you’re thinking this must be a pretty lightly loaded system, right? No, the processors hover around 50% utilization with peaks of 70%. The datafile I/O rate was some 6,980 IOPS of which 4,548/second were writes:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          3,222,269.99              4,570.66
              Logical reads:             66,770.61                 94.71
              Block changes:             18,097.81                 25.67
             Physical reads:              2,342.71                  3.32
            Physical writes:              4,548.36                  6.45
                 User calls:             11,795.67                 16.73
                     Parses:              3,597.63                  5.10
                Hard parses:                  0.27                  0.00
                      Sorts:                  1.00                  0.00
                     Logons:                  0.00                  0.00
                   Executes:             10,121.28                 14.36
               Transactions:                704.99

Moderate OLTP Load, db_writer_processes=1
Let’s take a look at the same workload after a shutdown, database restore and reboot of the instance:

$ grep ora_db noh*
oracle   15652     1  0 14:23 ?        00:00:00 ora_dbw0_bench1
oracle   15652     1 17 14:23 ?        00:02:55 ora_dbw0_bench1

What’s that? A single DBWR was able to service the same workload with only 175 CPU seconds? That’s the same work with 5% fewer processor cycles. But what did it do to throughput? Well, this is the medium model of the benchmark where the highest processor utilization peaks were roughly 70%. Reducing CPU utilization on a system that has ample CPU cycles to spare doesn’t get much. In fact, throughput only improved 2.4% from 46.3 Transactions Per Minute (TPM) to 47.4 TPM. Some would say that is within the margin of statistical error, but I bet those same people don’t have a setup were a database restore followed by an instance reboot will produce benchmark results within .5% like this setup does. Let’s just say it’s calibrated. Have you really read this far to hear about a paltry 2.4% difference in OLTP throughput though? No, that’s not all. The impact of too many DBWR processes is scheduling overhead and poor processor cache efficiency, so we need to look at a run with higher CPU utilization.

Heavy OLTP Load, db_writer_processes=8
The following ps(1) data shows us that with the increased OLTP load the 8 DBWR processes each consumed 32 CPU seconds for an aggregate of 256 CPU seconds. That’s still much less than the amount of cycles a single CPU had to offer-28% of a single CPU in fact. S

$ grep ora_db noh*
oracle   20763     1  0 15:59 ?        00:00:00 ora_dbw0_bench1
oracle   20765     1  0 15:59 ?        00:00:00 ora_dbw1_bench1
oracle   20767     1  0 15:59 ?        00:00:00 ora_dbw2_bench1
oracle   20769     1  0 15:59 ?        00:00:00 ora_dbw3_bench1
oracle   20771     1  0 15:59 ?        00:00:00 ora_dbw4_bench1
oracle   20773     1  0 15:59 ?        00:00:00 ora_dbw5_bench1
oracle   20775     1  0 15:59 ?        00:00:00 ora_dbw6_bench1
oracle   20777     1  0 15:59 ?        00:00:00 ora_dbw7_bench1
oracle   20763     1  3 15:59 ?        00:00:32 ora_dbw0_bench1
oracle   20765     1  3 15:59 ?        00:00:32 ora_dbw1_bench1
oracle   20767     1  3 15:59 ?        00:00:32 ora_dbw2_bench1
oracle   20769     1  3 15:59 ?        00:00:32 ora_dbw3_bench1
oracle   20771     1  3 15:59 ?        00:00:32 ora_dbw4_bench1
oracle   20773     1  3 15:59 ?        00:00:32 ora_dbw5_bench1
oracle   20775     1  3 15:59 ?        00:00:32 ora_dbw6_bench1
oracle   20777     1  3 15:59 ?        00:00:32 ora_dbw7_bench1

So here again the DBWR workload would “fit” within a single CPU. With this configuration, the OLTP throughput was 57.9 TPM. How does that compare to the same workload running against an instance with a single DBWR process?

Heavy OLTP Load, db_writer_processes=1
The following ps(1) output shows that a single DBWR process was able to service the same OLTP load with 291 CPU seconds-still much less (32%) than the bandwidth of a single CPU (there are 900 CPU seconds per core during the test).

$ grep ora_db noh*
oracle   21426     1  0 16:26 ?        00:00:00 ora_dbw0_bench1
oracle   21426     1 29 16:26 ?        00:04:51 ora_dbw0_bench1

But what about performance? When reducing DBWR count to 1, the OLTP throughput increased 8% from 57.9 to 62.6 TPM. What was the I/O load like though? Here is a snippet from statspack:

Load Profile                            Per Second       Per Transaction
~~~~~~~~~~~~                       ---------------       ---------------
                  Redo size:          5,918,344.96              4,355.71
              Logical reads:            125,684.76                 92.50
              Block changes:             33,622.21                 24.74
             Physical reads:              2,066.64                  1.52
            Physical writes:              8,483.06                  6.24
                 User calls:             21,996.51                 16.19
                     Parses:              6,737.41                  4.96
                Hard parses:                  0.27                  0.00
                      Sorts:                  1.03                  0.00
                     Logons:                  0.00                  0.00
                   Executes:             18,852.22                 13.87
               Transactions:              1,358.76

We see that with a single DBWR, the OLTP workload pushed through some 10,549 IOPS of which 8,483 per second were writes. That means a single DBWR on fairly outdated AMD Opteron processors (DL585) can service 8,483 writes per second with only 32% of a single CPU. So why configure more than that?

Where Did The Time Go?
Were there any significant waits on the 8 DBWR case? Yes, there were. Let’s see if anyone can guess which one is which:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                     3,148          35.8
log file sync                                1,207,863       2,905      2   33.1
log file switch (checkpoint incomplete)          4,627         828    179    9.4
db file scattered read                         172,724         670      4    7.6
log file parallel write                        532,373         464      1    5.3
          -------------------------------------------------------------

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
CPU time                                                     2,858          31.0
log file sync                                1,076,670       2,509      2   27.2
log file switch (checkpoint incomplete)          4,808       1,939    403   21.0
db file scattered read                         173,566         664      4    7.2
log file parallel write                        491,346         419      1    4.5
          -------------------------------------------------------------

Summary
Did he really make this long of a blog entry to discuss some 8% OLTP throughput increase? Well, yes and no. The affect of processor-cache thrashing will be more severe on larger SMP systems. This test system only has 4 sockets (8 cores). I’ve seen crazy DBWR configurations on 32 and 64 CPU systems account for more than 8%. Even if 8% was a constant, who wants to lose as much as 3 or 5 CPUs worth of bandwidth in today’s IT environment?



							

Learn How To Obliterate Processor Caches: Configure Lots and Lots of DBWR Processes. Part I.

Oracle8 introduced true multiple DBWR processes configured with the db_writer_processes parameter. I remember that the maximum value one could set this to back then was 10. Oracle8i raised that limit I recall. These are processes that build their own write batches (scan LRUs, modify cache buffers chains structs, etc) and perform their own I/O (synchronously or asynchronously). They also post (e.g., semop(2) on most platforms) processes waiting in free buffer wait. Originally, multiple true DBWR processes were implemented to support NUMA system that needed processes local to memory and I/O cards to flush buffers-thus eliminating remote I/O operations.

On systems with flat memory systems (non-NUMA) the only reason to implement multiple DBWR processes is if a single DBWR is processor bound. Such a condition is simple to detect because your free buffer waits will shoot through the roof and a cursory glance at top(1) will show DBWR pegged to the top.

I was just giving this NetApp paper a quick read and found the following absolute rule on page 9:

Async I/O is now recommended on all the storage protocols. The recommended value for db_writers_processes is to at least match the number of processor cores on the system.

Beside the fact that the author misspelled the initialization parameter db_writer_processes, this is not very good advice at all. I wouldn’t be as critical if they were to have recommended a DBWR process per socket bound via taskset(1), but one per core just floating around is going to impact L2 caches. The way this works is that the number of cache buffers LRUs are split amongst the DBWR processes. Each DBWR process will maintain a set of LRUs so even with a light amount of flushing, the work will be spread across the DBWR processes. Here is an absolute: Never break up a workload that can be satisfied with a single processor and force it to be distributed across multiple CPUs. When you do that you simply increase the scheduling overhead and increase the amount of work that is done cache-cold (L2). That is, lightly tasked DBWR processes are more likely to get scheduled on CPUs they haven’t recently run on-thus they are cache cold.

I don’t expect anyone to just take my word for it-although it would be nice if at least some did. I do get tired of proving fact that dates back to the mid-1990s. Ugh, that sounded grumpy, sorry. I’ll see if I can’t push a run through and provide some performance data for a pathological (1 per core, no affinity) versus 1 through 1-per-core with affinity. I haven’t done that in a while. It’ll be fun..yippie.

Summary
If anyone gives you an absolute, ignore it. That includes me. As sure as I’m sitting here I will get at least two emails on this topic. One will tell me that they increased db_writer_processes and “got a performance boost.” The other will tell me to remember that sometimes man bites dog.

Fundamentally, however, a single DBWR process on a flat-memory SMP given kernel async I/O, ample scheduling priority, non-preemption on ports that support it, and ample CPU cycles will be able to keep the SGA clean. If any of these criteria are not met then throwing another DBWR at the situation should help, but that is entirely different than starting with 1 per core.


EMC Employee Disclaimer

The opinions and interests expressed on EMC employee blogs are the employees' own and do not necessarily represent EMC's positions, strategies or views. EMC makes no representation or warranties about employee blogs or the accuracy or reliability of such blogs. When you access employee blogs, even though they may contain the EMC logo and content regarding EMC products and services, employee blogs are independent of EMC and EMC does not control their content or operation. In addition, a link to a blog does not mean that EMC endorses that blog or has responsibility for its content or use.

This disclaimer was put into place on March 23, 2011.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 1,719 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2013. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

Follow

Get every new post delivered to your Inbox.

Join 1,719 other followers