Network Appliance OnTap GX–Specialized for Transaction Logging.

Density is Increasing, But Certainly Not That Cheap
Netapp’s SEC 10-Q form for their quarter ending in October 2006 has a very interesting prediction. I was reading this post on StorageMojo about Isilon and saw this quote from the SEC form (emphasis added by me):

According to International Data Corporation’s (IDC’s) Worldwide Disk Storage Systems 2006-2010 Forecast and Analysis, May 2006, IDC predicts that the average dollar per petabyte (PB) will drop from $8.53/PB in 2006 to $1.85/PB in 2010.

Yes, Netapp is telling us that IDC thinks we’ll be getting storage at $8.53 per Petabyte within the next three years. Yippie! Here is the SEC filing if you want to see for yourself.

We Need Disks, Not Capacity
Yes, drive density is on the way up so regardless of how off the mark Netapp’s IDC quote is, we are going to continue to get more capacity from fewer little round brown spinning things. That doesn’t bode well for OLTP performance. I blogged recently on the topic of choosing the correct real estate from disks when laying out your storage for Oracle databases. I’m afraid it won’t be long until IT shops are going to force DBAs to make bricks without straw by assigning, say, 3 disks for a fairly large database. Array cache to the rescue! Or not.

Array Cache and NetApp NVRAM Cache Obliterated With Sequential Writes
The easiest way to completely trash an most array caches is to perform sequential writes. Well, for that matter, sequential writes happen to be the bane of NVRAM cache on Filers too. No, Filers don’t handle sequential writes well. A lot of shops get a Filer and dedicate it to transaction logging. But wait, that is a single point of failure. What to do? Get a cluster of Filers just for logging? What about Solid State Disk?

Solid State Disk (SSD) price/capacity is starting to come down to the point where it is becoming attractive to deploy them for the sole purpose of offloading the sequential write overhead generated from Oracle redo logging (and to a lesser degree TEMP writes too). The problem is they are SAN devices so how do you provision them so that several databases are logging on the SSD? For example, say you have 10 databases that, on average, are each thumping a large, SAN array cache with 4MB/s for a total sequential write load of 40MB/s. Sure, that doesn’t sound like much, but to a 4GB array cache, that means a complete recycle every 100 seconds or so. Also, rememeber that buffers in the array cache are pinned while being flushed to back to disk. That pain is certainly not being helped by the fact that the writes are happening to fewer and fewer drives these days as storage is configured for capacity instead of IOPS. Remember, most logging writes are 128KB or less so a 40MB logging payload is derived from some 320, or more, writes per second. Realistically though, redo flushing on real workloads doesn’t tend to benefit from the maximum theoretical piggy-back commit Oracle supports, so you can probably count on the average redo write being 64KB or less—or a write payload of 640 IOPS. Yes a single modern drive can satisfy well over 200 small sequential writes per second, but remember, LUNS are generally carved up such that there are other I/Os happening to the same spindles. I could go on and on, but I’ll keep it short—redo logging is tough on these big “intelligent” arrays. So offload it. Back to the provisioning aspect.

Carving Luns. Lovely. 
So if you decide to offload just the logging aspect of 10 databases to SSD, you have to carve out a minimum of 20 LUNS (2 redo logs per database) zone the Fibre Channel switch so that you have discrete paths from servers to their raw chunks of disk. Then you have to fiddle with raw partitions on 10 different servers. Yuck. There is a better way.

SSD Provisioning Via NFS
Don’t laugh—read on. More and more problems ranging from software provisioning to the widely varying unstructured data requirements today’s applications are dealing with keep pointing to NFS as a solution. Provisioning very fast redo logging—and offloading the array cache while you are at it—can easily be done by fronting the SSD with a really small File Serving Cluster. With this model you can provision those same 10 servers with highly available NFS because if a NAS head in the File Serving Utility crashes, 100% of the NFS context is failed over to a surviving node transparently—and within 20 seconds. That means LGWR file descriptors for redo logs remain completely valid after a failover. It is 100% transparent to Oracle. Moreover, since the File Serving Utility is symmetric clustered storage—unlike clustered Filers like OnTap GX—the entire capacity of the SSD can be provisioned to the NAS cluster as a single, simple LUN. From there, the redo logging space for all those databases are just files in a single NFS exported filesystem—fully symmetric, scalable NFS. The whole thing can be done with one vender too since Texas Memory Systems is a PolyServe reseller. But what about NFS overhead and 1GbE bandwidth?

NFS With Direct I/O (filesystemio_options=directIO|setall)
When the Oracle database—running on Solaris, HP-UX or Linux—opens redo logs on an NFS mount, it does so with Direct I/O. The call overhead is very insignificant for sequential small writes when using Direct I/O on an NFS client. The expected surge in kernel mode cycles due to the NFS overhead really doesn’t happen with simple positioning and read/write calls—especially when the files are open O_DIRECT (or directio(3C) for Solaris). What about latency? That one is easy. LGWR will see 1ms service times 100% of the time, no matter how much load is placed on the down-wind SSD. And bandwidth? Even without bonding, 1GbE is sufficient for logging and these SSDs (I’ve got them in my lab) handle requests in 1ms all the way up to full payload which (depending on model) goes up to 8 X 4Gb FC—outrageous!

Now that is a solution to a problem using real, genuine clustered storage. And, no I don’t think NetApp really believes a Petabyte of disk will be under $9 in the next three years. That must be a typo. I know all about typos as you blog readers can attest.

 

9 Responses to “Network Appliance OnTap GX–Specialized for Transaction Logging.”


  1. 1 Alex Gorbachev February 11, 2007 at 12:50 am

    Thanks
    Remember, most logging writes are 128KB or less…

    In my experience, redo writes are much smaller in OLTP (well, there are different applications out there) and with 300 commits per second I observed about 270-280 LGWR writes sizing on 13-15K each so piggy back commit doesn’t deliver much. Now if I look at the same system at night during batch time – commits per second go down to 100 (quite time) and redo writes still only about 10% lower but size increases to 25-30K due to batch processing.
    Anyway, the point is that I observed usually much smaller than 128K writes from LGWR.

    Another point about redo logs on steroids (like SSD). You will also need to make sure that ARCx processes are working very fast. I tested couple RamSan boxes a year ago and I could clearly see that the impact from ARCx kicking in was absolute 0 as long us we are below the box throughput capacity. However, the challenge is to make sure that your archive log destination writes really fast and without long slow downs. Writes are big (depending on platform I saw up to 1MB) so it’s easier that online redo logs storage but still it’s important not to over look it.

    Kevin, you also mentioned 2 logs per database and this is only theoretical while in practice you would most probably want much more due to (1) having some reserve if your archivers stuck and (2) have some room for incomplete checkpointing. Otherwise, chances are that the instance will freeze from time to time on log switch.

    Let’s do small pricing exercise… If we assume that online redo log is 1 GB and we have 4 groups per instance (somewhat active system) than we need 4 GB of RamSan. Account multiplexing (Oracle or external) and it doubles. Now RAMSAN-320 with 32 GB should cost about $46K and should be enough for 4 database instances.

    Also your example with 20 LUNs for 10 instances would go up to 80 LUNs which should make your argument even stronger. However, the perspective of management I observed so far is that they would think I’m crazy if I propose to provision most high end storage from SSD via NFS. Which might be understandable. 🙂 Life is life.

  2. 2 kevinclosson February 11, 2007 at 5:26 pm

    Alex,

    I said a minimum of 2 redo logs per database. You are right folks do tend to allocate more for the reason you point out (possible incomplete checkpoints). Remember, however, that with 100% of disk accesses to/from redo logs coming in at 1ms, it is pretty hard to get wrapped around the axle. Of course you can mess things up pretty badly if ARCH is spooling to an archived log dest that is really slow. That will cause problems regardless of how fast the ARCH reads from the redo logs are…

    As always, thanks for stopping by.

  3. 3 Alex Gorbachev February 11, 2007 at 6:11 pm

    with 100% of disk accesses to/from redo logs coming in at 1ms, it is pretty hard to get wrapped around the axle

    Actually, I saw that small IOs (8K-16K) were at 0.5 ms stable with RAMSAN but this probably depends on other components of infrastructure and OS.

    Anyway, performance of redo logs have very distant (I would say NONE) impact on checkpoints and some positive impact on archivers and I was able to wrap around easily with 2x1GB of redo logs in my tests.

    Standing by for the next NUMA post… 😉

  4. 4 kevinclosson February 11, 2007 at 6:35 pm

    “Standing by for the next NUMA post…”

    …ugh, that reminds me … 🙂

  5. 5 kevinclosson February 11, 2007 at 6:46 pm

    “Anyway, performance of redo logs have very distant (I would say NONE) impact on checkpoints and some positive impact on archivers …”

    What? Redo and checkpointing are joined at that hip. A log switch triggers a checkpoint as well as an ARCH spool to the archived log dest. Having CKPT, ARCH and LGWR invigorated simultaneously can wreak havoc on an array cache. This thread is about SSD to offload array cache after all. If you have the normal I/O demand and the sudden CKPT/LGWR/ARCH storm of a log switch checkpoint all hitting an array head you have an I/O storm on your hands. Offloading the ARCH sequential read and ongoing LGWR sequential write side of this storm is a good thing.

  6. 6 kevinclosson February 11, 2007 at 6:52 pm

    “Actually, I saw that small IOs (8K-16K) were at 0.5 ms stable with RAMSAN but this probably depends on other components of infrastructure and OS.”

    I see .5 when I write up to 1MB I/Os on RamSan as well, but the thread is about provisioning RamSan via scalable NFS. My tests there show 1ms latency on I/Os all the way up to the bandwidth limit of a bonded GbE (approx 188MBs)… I haven’t tested triple bonded, but I’m confident it would show 1ms at ~270MB/s which I should think would handle the redo I/O load of most normal humans 🙂

  7. 7 Alex Gorbachev February 12, 2007 at 2:35 am

    I’ll try to make it short this time. 🙂
    Checkpoint performance = DBWR performance. No _direct_ impact from redo logs performance.

    There might be lots of indirect impact like if one has 10 databases storing all kind of files on the same NAS mount point (truly SAME’ed if I may) and 3 database started backup and another 2 are under heavy batch and 3 more databases kicking in with checkpoints…
    And if it’s all done on a single Gbit Ethernet for all that than god help this poor shop.
    This is why I generally negative to this kind of provisioning – it makes everything so abstract removes visibility and magically seems performing until time X.

    Now, did I see that someone posted something about stupid questions…

  8. 8 kevinclosson February 12, 2007 at 3:53 am

    “I’ll try to make it short this time. 🙂
    Checkpoint performance = DBWR performance. No _direct_ impact from redo logs performance.”

    Uh, ok Alex, let’s say you get the last word…I never argue arithmetic…Let’s pick this up over beers..uh, say, Hotsos..or, are you coming to RMOUG?

  9. 9 Alex Gorbachev February 12, 2007 at 1:30 pm

    No RMOUG for me but we’ll meet at Hotsos.


Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.




DISCLAIMER

I work for Amazon Web Services. The opinions I share in this blog are my own. I'm *not* communicating as a spokesperson for Amazon. In other words, I work at Amazon, but this is my own opinion.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 743 other subscribers
Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2015. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.