BLOG UPDATE 2012.01.28: A lot has changed since this blog post so I need to point out that my mention herein about iDB ports to SPARC is clearly outdated. The production manifestation of the SPARC SuperCluster offers 6 Exadata Storage Servers in the full-rack configuration connected to the T4-4 hosts via the Exadata iDB protocol.
Preface
This blog entry is too long.
From Oracle Storage Strategy Update To TPC-C And Back (I Hope)
My recent blog entry entitled I Can See Clearly Now. Exadata Is Better Than EMC Storage! I Have Seen The Slides! Part I was pretty heavily read (over 7,000 views). I was concerned that blogging about something that happened two weeks ago might not be all that interesting. But, since my analysis (opinions) about the June 30, 2011 Oracle Storage Strategy webcast seems to resonate I thought I’d put out this installment.
What Do Transaction Processing Council Benchmarks Have To Do With The Oracle Storage Strategy Update?
I’ve been eagerly anticipating which of IBM or HP would be first to audit a TPC-C with the Xeon E7 (formerly Westmere EX) processor. These vendors have value-add systems componentry that properly extend the vanilla Xeon E7 + QPI capabilities to include scalable 8-socket and very large memory support.
IBM’s x3850 with MAX5 supports 96 32GB low-voltage DIMMS for a total of 3TB RAM with just 4 Sockets. IBM proved the strength of the x3850 several months ago with a 4-socket Nehalem EX (Xeon 7500) result of a little over 2.3 million TpmC. So, part of me was not all that surprised to find that they were able to stay with the recipe and publish a result of just over 3 million TpmC with the Xeon E7 processor and MAX5 chipset (July 11, 2011). But that has nothing to do with the Oracle Storage Strategy webcast and, in fact, since it was a DB2 number with Linux it has very little to do with Oracle. So why am I blogging this?
While the 3 million TpmC result represents roughly 30% improvement over the Nehalem EX-based result for IBM, I’m saddened the entry was not an 8-socket result. Why? Well, I’ll put it this way. If IBM and HP can’t seem to make 8-socket Xeon boxes able to scale contentious workloads (like TPC-C) then it’s quite likely nobody can. It looks like 8-socket Xeon scalability is still out of reach for us. That is just too bad. But that has nothing to do with the Oracle Storage Strategy webcast. So why am I blogging this? I’m getting to it, trust me.
While perusing the main TPC-C all-results page I noticed three interesting things and one of them actually has to do with the Oracle Storage Strategy webcast!
The three things that caught my eye were:
- There are non-clustered Xeon results in the top ten! Sure, the prior IBM x3850 result was in the top ten but when it was published I didn’t catch on to that fact. It wasn’t too long ago that non-clustered x86 boxes were so far down the list as to not matter.
- In the ranks of the top-ten results there are two submissions that are less than $1.00/TpmC. I think that is quite significant when you compare to historical costs. Top ten TPC-C results with Xeon at < $1.00/TpmC—wow.
- None of the products mentioned in the Oracle Storage Strategy webcast appear in the top ten TPC-C nor TPC-H for that matter. The last Oracle TPC-H result was a 3TB scale M9000 result with Sun Storage 6000 (Sun Storage 6000 is LSI Engenio hardware and the Engenio brand is now owned by Netapp for what it’s worth).
So, obviously, point 3 in the list is what brings me back to the Oracle Storage Strategy Update June 29, 2011 (slides). If one publishes an industry benchmark that performs 3x over the closest competitor—as Oracle did with the SuperCluster 30 million TpmC result—wouldn’t the system (including storage) used to do so be considered a premiere system offering? One would think so—especially when the workload is an I/O intensive workload! But no, generally speaking the configurations used in TPC benchmarks are not to be confused with systems intended for production.
Concept Car or Production Car
The difference between TPC configurations and production configurations is a lot like the difference between a concept car and a car offered by the same manufacturer that is actually sitting on a lot with a price sticker on it. The concept car and the production car have a lot in common—but the differences are usually pretty obvious as well. We shouldn’t have a problem with this. I still think TPC benchmarks are good for certain purposes. An example of one such purpose is to see just how small the line is getting between the “concept car” and the “production car.”
SuperCluster Storage or Oracle Storage Strategy Line-up?
No, the “SuperCluster Storage” that was used for the 30 million TpmC result is not in the Storage Strategy line-up. So then what was the 30 million TpmC “concept car” storage? Take a peek at this link or let me summarize. The SuperCluster storage consisted of the following main ingredients:
- 97 Sun X4270M2 servers with one Intel Xeon removed. The 4270 servers ran Solaris and COMSTAR. As such, the servers play the role of “array heads” in order to perform protocol exchange between SAS and Fibre Channel. Why? Because the storage networking was Fibre Channel (108 8GFC Fibre Channel HBAs connecting the 27 Real Application Clusters nodes (4 HBAs each) to the COMSTAR heads and SAS from the COMSTAR heads to the storage.
- 138 Sun Storage F5100 Flash Array devices. That bit was $22,000,000. Remember the analogy about the concept car.
So a high-level schematic of the flow of data was F5100 SAS->COMSTAR head (SAS to FC)-> FC switches-> Sun T3-4 Servers. Don’t be alarmed by that many “hops” because they don’t really matter. Indeed, the 30 million TpmC SuperCluster delivered an average New Order response time of 0.35s, which is 69% faster than the IBM p780 result of 1.14 seconds. That’s a point Oracle marketing pushes vigorously. Oracle marketing doesn’t, however, seem to push the fact that while HP was still Oracle’s premiere hardware partner they teamed with HP to deliver what was, at the time, a world record TPC-C using the recently-shunned Itanium processor. Moreover, they most certainly don’t push the fact that the circa-2007 Itanium TPC-C with Oracle10g delivered New Order average service times of 0.24s—which was 32% faster service times than the SuperCluster! Fine details matter.
Concept Car to Oracle Storage Strategy Update
No, there is no evolution from concept to reality where the COMSTAR+F5100 approach is concerned. In fact, Oracle spelled out quite clearly how the storage recipe for these SuperClusters will be “Sun ZFS Storage 7420” which means either FC, iSCSI or NFS—but no Exadata since there is no port of Exadata iDB to SPARC (as of the publish date of this article). I think the ZFS Storage Appliance is a reasonable product but I wouldn’t want to stick my arm in the unified storage meat-grinder with the likes of EMC VNX and Netapp.
So, no, the storage used for the SuperCluster TPC-C shows no promise at this time of evolving from concept to production. However, Oracle customers should be glad because yet another addition to the storage strategy would be all too confusing in my opinion.
Final Words About That IBM x3850 Xeon E7 TPC-C Result
The Oracle SuperCluster result of 30 million TpmC (.353s average New Order service time) didn’t beat out the service times of the ancient Itanium 2 based SuperDome New Order transactions, but at least it also failed to beat the IBM x3850 average service times!
The IBM x3850 pumped out over 3 million TpmC with average New Order service times of .272s and all that for $.59/TpmC. How? Well, the storage wasn’t a concept. The lion’s share of the I/O was serviced by 136 SFF SAS SSDs! That’s about 1/50th the cost for storage for 1/10th the transaction throughput when compared to the SuperCluster. And faster transaction service times too.
Intel Xeon is my concept car of choice—and you can run about any software you so choose on it so that makes it even better. And regardless of what software I chose to run I would rather it not be stored in “concept storage.”
Summary
This blog entry was too long.


>> I’m saddened the entry was not an 8-socket result
I’m pretty sure you are aware of the scuttlebut that there _was_ an 8-socket Xeon TPC-C result run by HP (on Nehalem though, not E7), but Oracle wouldn’t let HP publish it… I guess this might not be the whole story, but an interesting data point:
http://www.theregister.co.uk/2011/03/11/oracle_allegedly_stifles_hp_oracle_tpc_benchmark/
Yes I am intimately familiar with that DL980 Xeon 7500 result….I saw all the results files…as did my friends at Violin Memory Systems…..politics….
Hi Kevin,
There is 8 socket Xeon OLTP data over at TPC-E with SQL Server on Nehalem and Westmere-EX. For example the Fujitsu Primergy RX900 S2 with the E7-8870s at 4555.54 tpsE. there is also 4 socket data here such as the IBM system x3860 X5 with the E7-4870s at 2862.61 tpsE. So this gives published TPC data of 4 to 8 socket scalability of 1.59X to start with.
Cheers,
Steve
Ah Steve my old friend…thanks for pointing that out…1.6 scalability is a bit of a bummer though….
Kevin,
I wouldn’t necessarily agree, for the first data available this is already a figure that would beat most 4 x 2 socket or 2 x 4 socket clustered alternatives to 8 socket. As you know you also need the OS and database to scale as well – there is some database software out there that struggles to scale going from 2 to 4 socket irrespective of platform and OS. So my view is this is not too bad at all, it shows that the OS and database does scale and I’m sure we’ll see more 8 socket data going forward.
Cheers,
Steve
I’ll eagerly await more 8-socket action! Thanks for stopping by, Steve.
Kevin,
hmm a I missing something, ain’t the Sun Fire x4800 a 8 socket platform?
G
Hi George,
Yes the Sun x4800 is 8-socket glueless reference implementation but what point are you drawing out?
Nothing, just questioned the availability of 8 socket platforms, and I think I know of 2, the DL870 from HP and the x4800 from Oracle, or is there something in their architecture which precludes them from doing a vertical scalability test of 8 sockets.
G
The point I was making in the post is that even these flag-ship value-add 8S platforms like IBM with eX5 and HP with PREMA are not auditing 8S TPC-C so the scalability of 8S Xeon still eludes us.
You mention Violin – I’ve been looking at the blogs of respected database performance experts but nobody seems to mention the consequences of the latest generation of flash storage arrays. Some (perhaps ill-advised) claims on the Violin website indicate that data can be read from “disk” faster than over the cluster interconnect on a RAC system. Whether that is true I don’t know, but if physical I/O can be an order of magnitude faster at sustained rates surely the world of database performance tuning must be turned upside down?
I also note that Violin are making a big play on how they can compete with Exadata, EMC and IBM. For a small company their marketing department seems just as “committed” as the legendary Oracle marketing people (Unbreakable Linux anyone?)
Hi Dmitry,
I have friends at Violin. They are good, honest people. I’m not aware of the comparison between a RAC cr-send and a “go read it yourself physical I/O” but the topic of such comparison is interesting. We know that the RAC inter-node communications library (skgxp) is implemented over several network protocols and physical networking technology ranging from 1GbE with UDP to RDS RDMA over Infiniband. Both a read from Violin and a cr-receive from a RAC instance measure in the microseconds but I sort of don’t care about that because the comparison is a bit moot. Violin accelerates both writes *and* reads so the mention of RAC cr-sends sort of falls through the cracks. In the same vein, EMC FAST accelerates *both* writes and reads.
I don’t know when the trend started, but the rampant ignorance about the importance of writes in an OLTP/ERP environment makes me dizzy. If a platform can’t scale writes along with reads what good is it? This is one of my long-time issues with Exadata marketing and Oracle executive chest-thumping in particular. While they rightfully tout the fact that a full-rack Exadata X2 configuration can sustain about 1.5 million random single block reads (e.g., db file sequential read) they shamefully overlook the importance of the fact that it can only sustain 50,000 write IOPS. The 50,000 WIOPS is a gross capacity. ASM shaves on 50% of that with normal redundancy.
So, if your application exhibits a read:write ratio of 60:1 then the nose-bleed Exadata read IOPS capacity is something you’ll benefit from. If your application exhibits a more realistic ratio of reads to writes like, say, 5:1 then the Exadata WPIO capacity acts like an shackle on the ankle and holds the application back to 125,000 RIOPS. I appreciate the value of headroom, but the massive disparity between read and write capacity makes the hype about 1.5 million IOPS intellectually dishonest.
DISCLAIMER: Oracle Legal: Please don’t confuse yourselves over my mention of skgxp() and the Oracle wait event “db file sequential read” in this blog comment. Knowledge of the purpose libskgxp serves, nor the definition of “db file sequential read”, constitutes disclosure of confidential information.