We are all at different skill and technical sophistication levels so this post will look like fluff to some readers. This is just a quick post to show a clear depiction of a SAN configuration that wouldn’t do anyone any favors in Oracle environment.
I showed some snapshots of the lab gear at my disposal in this blog entry. There are a lot of SANs in the lab here, but I can’t say which brand of SAN array I’m blogging about today, and honestly, this problem is not necessarily the architecture of the particular SAN array controller in question. Yes, this is a low-end array, but the concepts I’m blogging about are relevant through the high-end. The following vmstat(1) output shows one of the problems I’m blogging about:
# vmstat 10 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 3 3 0 31577944 81948 653352 0 0 96 597 13 34 0 7 89 4 5 4 0 31550968 82204 653096 0 0 73732 89913 2919 57992 0 10 60 30 0 4 0 31576760 82332 652968 0 0 74555 98056 2942 55904 0 10 60 30 1 6 0 31553824 82460 652840 0 0 80289 85870 2831 61435 0 10 63 28 0 4 0 31576968 82460 652324 0 0 79466 85397 2860 60537 0 10 62 28 8 0 0 31573368 82588 654260 0 0 41279 46733 1917 36035 0 32 53 15 8 0 0 31572896 82588 653744 0 0 11613 13344 1239 13387 0 49 46 5 5 0 0 31572528 82588 654260 0 0 8875 9039 1193 10727 0 47 49 4 7 0 0 31572280 82588 653744 0 0 6559 7057 1146 9823 0 47 49 4 6 0 0 31572176 82588 653744 0 0 6556 7048 1156 9121 0 50 47 3 4 0 0 31572072 82588 654260 0 0 5736 5956 1136 7955 0 47 50 3 6 0 0 31571576 82588 654260 0 0 4205 5529 1126 7851 0 47 51 2 8 0 0 31571584 82588 654260 0 0 5960 5264 1132 7982 0 48 49 3 7 0 0 31570896 82652 654712 0 0 4239 5167 1119 8086 0 47 51 2 11 0 0 31569968 82652 654712 0 0 5116 4995 1124 7437 0 55 43 2 7 0 0 31570184 82684 653648 0 0 4782 5082 1125 7880 0 54 44 2
This vmstat(2) output shows that Oracle was getting upwards of 165MB/s (via a single 2Gb FCP path) of combined read/write throughput until the array cache reached its saturation point. At that point, the throughput of the array was relegated to the bandwidth of the spindles being accessed.
Its All About Balance
I have a pretty simple view of storage and it is worth quoting myself:
Physical I/O is a necessary evil. Bottlenecking downwind of the SAN array cache is silly. Bottlenecking above the disks is foolishness.
Bottlenecking The Array (Silliness)
Configuring LUNs with insufficient spindle count to handle the cache miss (read-through) and write-back overhead is what I call bottlenecking the SAN array. This sort of bottleneck is strictly a configuration issue and therefore falls into the silliness category. The vmstat(1) output above is an example of bottlenecking the array. That is, the array could certainly deliver more throughput, but the spindle count was holding it back. Again, I’m not going to talk about what vendor’s SAN array controller it was because as it turns out any of them will act this way under these circumstances. I had a LUN of some 500GB that consisted of very few spindles so it was no surprise to me that the throughput was lousy. This LUN had to be configured this way due to other constraints on the array’s usage (read Kevin had to share some hardware). I got to live the pain I blogged about recently in this post about capacity versus spindle count. I could configure a more reasonable number of drives and get around this performance problem, but only up to the point where the tables turn and the array starts to bottleneck the disks.
Bottlenecking The Disks (Foolishness)
What do I mean by this? Well, most modern SAN arrays will bottleneck long before your application realizes the full bandwidth potential of the all the drives the array can support. I call this bottlenecking the disks and it falls into the foolishness category. This is basic foolishness on the SAN array vendor’s part. Why build storage arrays with horrible imbalance between capacity and performance?
Let me go over a typical case of a SAN array that bottlenecks the disks. Here is a link to some technical detail of a high end array. This particular array supports roughly 140TB of capacity when configured with the maximum spindle count (1024) of 146GB drives. On page 3 of the document, the vendor states the first figure related to bandwidth by citing the internal bandwidth of 15GB/s. What does that mean to the servers connected to this array? Reading further (pg. 15) the document states that there is some 10.6GB/s of aggregate data bandwidth and 5.3GB/s bandwidth for control. This means that if you get all the kings horses and all the kings men to work it out, you could feed 10.6GB/s of data to your servers. Indeed, 10.6GB/s is a great deal of bandwidth. So what am I talking about? Taking another look at page 15 we see the vendor’s claims that a single 146GB drive can deliver 99.9MB/s ( max theoretical). If you wanted to drive all these spindles at full throttle, they would theoretically deliver 99.9GB/s (1024 disks X 99.9MB/s) which is much less than the maximum theoretical data bandwidth of the array. In fact, if you drove “only” about 108 of these spindles at full throttle you’d saturate the array. I quoted the word only because 108 is a lot of disks, but that is only 10% of what the array supports.
If Oracle needs to do a physical I/O, let’s not bottleneck somewhere in the pipe! Think about it this way, in the technology stack I’m discussing (Oracle, high-end SAN array, etc), hard drive technology represents the simplest and least advanced component. That is, while hard drives are faster than they were 10 years ago, they have not fundamentally changed. They are still round and brown and they spin. Wouldn’t you think homely old disk drives would be the bane of performance? They aren’t. If we could drive all our disks at their maximum throughput, we’d be in a much better place performance-wise.
Chewing On Crushed Glass
This technical write-up about data warehousing experiences on a high-end SAN array is a perfect example of what must certainly be about as enjoyable as chewing on crushed glass. Ignore the many improper notation errors (e.g., using ‘b’ where ‘B’ is really intended). The write-up details the pain so I think it is important.
Summary
Hard drives are miserably low-tech necessary evils. Will we ever get a storage architecture where the more sophisticated components don’t make matters worse? Yes, we will. I’ll tell you all more as time passes. In the meantime, I bet dollars to donuts that the paltry 128 drives used in the TPC-H benchmark I blogged about were being driven at or near full bandwidth. That, my dear readers, is cool.
Hi Kevin,
Minor nitpick:
In the paragraph titled “Bottlenecking the Array”, the third sentence is “I call this bottlenecking the disks and it falls into the foolishness category.”
Shouldn’t that read “…bottlenecking the array…”?
-Mark
I think the idea is that if you are doing any amount of random I/O, the per disk throughput drops enormously, and you will not bottleneck the disks in a typical configuration.
Even if it’s sequential, you will be driving many many servers at the data rates you are talking about. Maybe there is an element of randomness introduced by ‘context switching’ on the disks with that many streams, if that makes any sense?