BLOG UPDATE (02-Feb-2010): This post has confused some readers. I make mention in this post how Exadata Storage Server does not cache data. Please remember that the topic of this post is an audited TPC-H result that used Version 1 Exadata Storage Server cells. Version 2 Exadata Storage Server is the first release that caches data (the read-only Exadata Smart Flash Cache).
I’d like to tackle a couple of the questions that have come at me from blog readers about this benchmark:
Kevin, I saw the 1TB TPCH benchmark number. It is very huge. You say Exadata does not cache data so how can it get such result?
True, I do say Exadata does not cache data. It doesn’t. Well, there is a .5 GB write cache in each cell, but that doesn’t have anything to do with this benchmark. This was an in-memory Parallel Query benchmark result. The SGA was used to cache the tables and indexes. That doesn’t mean there was no physical I/O (e.g., sort spilling, etc), but the audited runs were not a proof-point for scanning tables or indexes with offload processing.
Under The Covers
There were 6 HP Oracle Exadata Storage Servers (cells) in the configuration. Regular readers therefore know that there is no more than 6 GB up-wind bandwidth regardless of whether or not the data is cached in the cells. The database grid in this benchmark had 512 Xeon 5400 processor cores. I assure you all that 6 GB/s cannot properly feed 512 of such processor cores since that is only 12 MBPS/core.
Let me just point out that this result is with Oracle Database 11g Release 2 on a 64-node database grid with an aggregate memory capacity of roughly 2TB. The email continued:
I guess this prove Oracle with Exadata is 10x faster?
I presume the reader was referring to the result in the prior Oracle Database 11g 1TB TPC-H with conventional storage. Folks, Exadata can be 10x faster than Oracle on state of the art conventional storage (generally misconfigured, poorly provisioned, etc). No argument here. But, honestly, I can’t sit here and tell you that 6 Exadata cells with 72 disks is 10x faster than 768 15K RPM drives connected via 128 4Gb Fibre Channel ports used in the prior Oracle 1TB result since that is about 50 GB/s theoretical I/O bandwidth. If you investigate that prior Oracle Database 11g 1TB TPC-H result you’ll see that it was configured with less than 20% of the RAM used by the new Oracle Database 11g Release 2 result (2080 GB aggregate vs 384 GB).
So, what’s my point?
This new world-record is a testimonial to the scalability of Real Application Clusters for concurrent, warehouse-style queries. As much as I’d love to lay claim to the victory on behalf of Exadata, I have to point out, in fairness, that in spite of playing a role in this benchmark the result cannot be attributed to the I/O capability of Exadata.
In short, there is no magic in Exadata that makes 6 12-disk storage cells (72 drives) more I/O capable than 768 drives attached via 128 dual-port 4GFC HBAs.
I’m just comparing one Oracle Database 11g result to another Oracle Database 11g result to answer some blog readers’ questions.
So, no, Exadata is not 10x faster on a per-disk basis. Data comes off of round-brown spinning thingies at the same rate when downwind of Oracle via Exadata or Fibre Channel. The common problem with conventional storage is the plumbing. Balancing the producer-consumer relationship between storage and an Oracle Database grid with conventional storage even at the rate produced by a measly 6 Exadata Storage Server cells can be a difficult task. Consider, for example, that one would require a minimum of 15 active 4GFC host bus adapters to deal with 6GB/s. Grid plumbing requires redundancy so one would require and additional 15 4GFC paths through different ports and a different switch in order to architect around single points of failure. I’ve lived prior lives rife with FC SAN headaches and I can attest that working out 30 FC paths can be a real headache.
Hi Kevin,
I was tempted to write you too, but I was busy and I guessed you would write about it sooner than later…
Anyway, I read your post, it feels like a teaser, not a show off 🙂
here are my thoughts:
HP-Oracle Database Machine is supposed to be well balanced system for normal DW usage. It keeps a ratio of 4:7 db nodes:storage nodes.
This system is significantly different…
What’s going on? Is the benchmark so different from “normal” DW usage? why do we need so many compute nodes?
I would guess the DB doesn’t use SmartScan (full scans) much for these bencmark queries (because of the ratio – most processing is in the 64 DB servers). So, some combination of bitmap indexes and MV and cubes? If so, we need a lot of random access I/O – but we have only 72 disks! What is the magic sauce?
If we can’t do a lot of I/O, it means we cache. But the setup has only 1/3TB of aggregate memory. Maybe compression comes into play (I have to look into the full disclosue)… Even so, it means that RAC cache fusion over 64 nodes is super effective – if true, it is pretty cool!
…
Re-reading, I see you give the credit to the 11g Release 2 RAC.
OK, got to go to browse through the full disclosure… thanks for the brain exercise 🙂
P.S. 64-node RAC!
If it cannot possibly be doing that much physical I/O, it’s probable that it isn’t.
you’re right, I was just writing down my train of thinking…
Anyway, some further thoughts after reading the report:
– First, I read the memory size wrong…
each node has 32GB X 64 nodes ==> 2TB
however, DB_CACHE_SIZE is 19GB x 64 nodes ==> 1.2TB
I’m not an expert on TPC-H, but I guess 1TB means “generating 1TB of flat files to load” –> which is smaller than 1TB after compression (but we need to cache also indexes, MV, cubes etc).
still, if a 64-node can avoid most I/O by using aggregating the cache, that is just amazing…
– at first I wasn’t sure there is a special 11g R2 thing – compatible=11.1.0.7 and optimizer_features_enable = 11.1.0.7.1
However, I then saw two new init.ora parameters about parallel/RAC, so it might be a hint to the secret sauce 🙂
– Smart Scan was really off (not needed), as I suspected – cell_offload_processing = false
– 512 hash subpartition per day… lovely 🙂 Why don’t I get paid to play with these kind of toys…
– loading 1TB of raw data with compression + indexing it + collecting statistics ==> 2 hours 23 minutes… With only six cells (but heavy parallel everything). This is also awesome!
OK, now I’ll try to be patient and wait for 11gR2 to eventually come to the light
Even if you can’t say it, a very impressive result.
It is a hugely impressive result and it deserves discussion. Don’t ask me to explain why I’m not allowed blog about it. Readers would sigh and roll their eyes.
Probably cannot blog about it since it uses 11.2 in the benchmark apparently.
Word is leaking out ( no big surprise ) that 11.2 goes out on or just after Sept 1st however until that happens various SEC related rules probably make it difficult for oracle employees to discuss perhaps.
yes, things change…
Which version of 11.2 was the benchmarking done on?
The released version of 11.2 is 11.2.0.1 apparently. It is a little unclear ( to me at least and I suspect a bunch of others ) if there is some additional maintenance pending on 11.2 that did not make it into the September 1st release.
Any hints/nudges/winks or just the old “sorry I cannot say anything” about what is going on here?
Some of us are going to be thinking about 11.2 and how soon to start planning on moving over to it ( with appropriate testing of course ).
I wasn’t involved with the benchmark…well, unless you count sitting in on a few meetings an involvement…