I’m sure this information is not new to very many folks, but there might be some intersting stuff in this post…
I’m doing OLTP performance testing using a DL585, 32GB, Scalable NAS, RHEL4 (Real, genuine RHEL4) and 10gR2. I’m doing some oprofile analysis on both the NFS client and server. I’ll blog more about oprofile soon.
This post will sound a bit like a rant. Did you know that the Linux 2.5 Kernel Summit folks spent a bunch of time mulling over features that have been considered absolutely critical for Oracle performance on Unix systems for over 10 years! Take a peek at this partial list and chuckle with me please. After the list, I’m going to talk about DBWR. A snippet from the list:
- Raw I/O has a few problems that keep it from achieving the performance it should get. Large operations are broken down into 64KB batches…
- A true asynchronous I/O interface is needed.
- Shared memory segments should use shared page tables.
- A better, lighter-weight semaphore implementation is needed.
- It would be nice to allow a process to prevent itself from being scheduled out when it is holding a critical resource.
Yes, the list included such features as eliminating the crazy smashing of Oracle multiblock I/O into little bits, real async I/O, shared page tables and non-preemption. That’s right. Every Unix variant worth its salt, in the mid to late 90s, had all this and more. Guess how much of the list is still not implemented. Guess how important those missing items are. I’ll blog some other time about the lighter-weight semaphore and non-preemption that fell off the truck.
I Want To Talk About Async I/O
Prior to the 2.6 Linux Kernel, there was no mainline async I/O support. Yes, there were sundry bits of hobby code out there, but really, it wasn’t until 2.6 that async I/O worked. In fact, a former co-worker (from the Sequent days) did the first HP RAC Linux TPC-C and reported here that the kludgy async I/O that he was offered bumped performance 5%. Yippie! I assure you, based on years of TPC-C work, that async I/O will give you much more than 5% if it works at all.
So, finally, 2.6 brought us async I/O. The implementation deviates from POSIX, which is a good thing. However, it doesn’t deviate enough. One of the most painful aspects of POSIX async I/O, from an Oracle perspective, is that each call can only initiate writes to a single file descriptor. At least the async I/O that did make it into the 2.6 Kernel is better in that regard. With the io_submit(2) routine, DBWR can send a batch of modified buffers to any number of datafiles in a single call. This is good. In fact, this is one of the main reasons Oracle developed the Oracle Disk Manager (ODM) interface specification. See, with odm_io(), any combination of reads and writes whether sync or async to any number of file descriptors can be issued in a single call. Moreover, while initiating new I/O, prior requests can be checked for completion. It is a really good interface, but was only developed by Veritas, NetApp and PolyServe. NetApps’ version died because it was locked to tightly with DAFS which is dead, really dead (I digress). So, yes, ODM (more info in this, and other papers) is quite efficient at completion processing. Anyone out there look at completion processing on 10gR2 with the 2.6 libaio? I did (a long time ago really).
Here is a screen shot of strace following DBWR while the DL585 is pumping a reasonable I/O rate (approx 8,000 IOPS) from the OLTP workload (click the graphic for better display):
Notice anything weird? There are:
38.8 I/O submit calls per cpu second (batches)
4872 I/O complete processing calls per cpu second (io_getevents())
7785 wall clock time reading calls per cpu second (times + gettimeofday)
Does Anyone Really Know What Time It Is?
Database writer, with the API currently being used, (when no ODM is in play) is doing what we call “snacking for completions”. This happens for one of many reasons. For instance, if the completion check was for any number of completions, there could be only 1 or 2. What’s with that? If DBWR just flushed, say, 256 modified buffers, why is it taking action on just a couple of completions? Waste of time. It’s because the API offers no more flexibility than that. On the other hand, the ODM specification allows for blocking on a completion check until a certain number, or certain request is complete—with an optional timeout. And like I said, that completion check can be done while already in the kernel to initiate new I/O.
And yes, DBWR alone is checking wall clock time with a combination of times(2) and gettimeofday(2) at a rate of 7,785 times per cpu second! Wait stats force this. The VOS layer is asking for a timed OS call. The OS can’t help it if DBWR is checking for I/O completes 4,872 times per cpu second—just to harvest I/Os from some 38.8 batch writes per cpu second…ho hum… you won’t be surprised when I tell you that the Sequent port of Oracle had a user mode gettimeofday(). We looked at it this way, if Oracle wants to call gettimeofday() thousands of times per second, we might as well make it really inexpensive. It is a kernel-mode gettimeofday() on Linux of course.
What can you do about this? Not much really. I think it would be really cool if Oracle actually implemented (Unbreakable 2.0) some of the stuff they were pressing the Linux 2.5 developers to do. Where do you think the Linux 2.5 Summit got that list from?
What? No Mention of ASM?
As usual, you should expect a tie in with ASM. Folks, unless you are using ASMLib (and therefore have my pitty), ASM I/Os are fired out of the normal OSDs (Operating System Dependent code) which is libaio on RHEL4. What I’m telling you is that ASM is storage management, not an I/O library. So, if a db file sequential read, or DBWR write is headed for an offset in an ASM raw partition, it will be done with the same system call as it would be if it was on a CFS or NAS.
PolyServe Offers ODM, Why Even Look At Libaio?
Well, I’m testing Oracle on NAS. The only PolyServe in that paradigm is the File Serving Utility for Oracle (NAS). The database servers are just plain vanilla Redhat systems running RHEL4. But, even if that wasn’t that case, I still look at this sort of stuff. In fact, I look at OCFS and GFS. Unlike some unmentioned folks out there who gleefully cast stones without bothering to measure first, I refuse to just believe our stuff is good. I test both sides. Finally, it turns out that since the 2.6 io_submit() call operates on multiple file descriptors it provides roughly 6% better performance than our ODM library on 100% processor-saturated workloads (OLTP only). What? When was the last time someone working for a software house admitted their product was anything other than perfect? Yes, our ODM library does much more than just I/O—given the clusterwide I/O monitoring capabilities—and it did indeed offer a clear performance improvement over that junky pre-2.6 “async I/O” stuff. But with the advent of the io_submit() call, I am finding a slight edge over our ODM. There, a little cleansing honesty. I bet you wish there was more of it out there! Do our customers run 100% processor-bound systems? No. Do they love the cluster-wide I/O monitoring? Yes. Will we fix the performance issue? In time yes.