Oracle Exadata Storage Server Version 1. A “FAQ” is Born. Part I.

BLOG UPDATE (22-MAR-10): Readers, please be aware that this blog entry is about the HP Oracle Database Machine (V1).

BLOG UPDATE (01-JUN-09). According to my blog statistics, a good number of new readers find my blog by being referred to this page by google. I’d like to draw new readers’ attention to the sidebar at the right where there are pages dedicated to indexing my Exadata-related posts. The original blog post follows:

I expected Oracle Exadata Storage Server to make an instant splash, but the blogosphere has really taken off like a rocket with the topic. Unfortunately there is already quite a bit of misinformation out there. I’d like to approach this with routine quasi-frequently asked question posts. When I find misinformation, I’ll make a blog update. So consider this installment number one.

Q. What does the word programmable mean in the product name Exadata Programmable Storage Server?

A. I don’t know, but it certainly has nothing to do with Oracle Exadata Storage Server. I have seen this moniker misapplied to the product. An Exadata Storage Server “Cell”-as we call them-is no more programmable than a Fibre Channel SAN or NAS Filer. Well, it is of course to the Exadata product development organization, but there is nothing programmable for the field. I think, perhaps, someone may have thought that Exadata is a field programmable gate array (FPGA) approach to solving the problem of offloading query intelligence to storage. Exadata is not field-“programmable” and it doesn’t use or need FPGA technology.

Q. How can Exadata be so powerful if there is only a single 1gb path from the storage cells to the switch?

A. I saw this on a blog post today and it is an incorrect assertion. In fact, I saw a blogger state, “1gb/s???? that’s not that good.” I couldn’t agree more. This is just a common notation blunder. There is, in fact, 20 Gb bandwidth between each Cell and each host in the database grid, which is close to 2 gigabytes of bandwidth (maximum theoretical 1850MB/s due to the IB cards though). I should point out that none of the physical plumbing is “secret-sauce.” Exadata leverages commodity components and open standards (e.g., OFED ).

Q. How does Exadata change the SGA caching dynamic?

A. It doesn’t. Everything that is cached today in the SGA will still be cached. Most Exadata reads are buffered in the PGA since the plan is generally a full scan. That is not to say that there is no Exadata value for indexes, because there can be. Exadata scans indexes and tables with the same I/O dynamic.

Q. This Exadata stuff must be based on NAND FLASH Solid State Disk

A. No, it isn’t and I won’t talk about futures. Exadata doesn’t really need Solid State Disk. Let’s think this one through. Large sequential read and write speed is about the same on FLASH SSD as rotating media, but random I/O is very fast. 12 Hard Disk Drives can saturate the I/O controller so plugging SSD in where the 3.5″ HDDs are would be a waste.

Q. Why mention sequential disk I/O performance since sequential accesses will only occur in rare circumstances (e.g., non-concurrent scans).

A. Yes, and the question is what? No, honestly. I’ll touch on this. Of course concurrent queries attacking the same physical disks will introduce seek times and rotational delays. And the “competition” can somehow magically scan different table extents on the same disks without causing the same drive dynamic? Of course not. If Exadata is servicing concurrent queries that attack different regions of the same drives then, yes, by all means there will be seeks. Those seek, by the way, are followed by 4 sequential 1MB I/O operations so the seek time is essentailly amortized out.

Q. Is Exadata I/O really sequential, ever?

A. I get this one a lot and it generally comes from folks that know Automatic Storage Management (ASM). Exadata leverages ASM normal redundancy mirroring which mirrors and stripes the data. Oh my, doesn’t that entail textbook random I/O? No, not really. ASM will “fill” a disk from the “outside-in. ” This does not create a totally random I/O pattern since this placement doesn’t randomize from the outer edge of the platters to the spindle and back. In general, the “next” read on any given disk involved in a scan will be at a greater offset in the physical device and not that “far” from the previous sectors read. This does not create the pathological seek times that would be associated with a true random I/O profile.

When Exadata is scanning a disk that is part of an ASM normal redundancy disk group and needs to “advance forward” to get the next portion of the table, Exadata directs the drive mechanics to position at the specific offset where it will read an ASM allocation unit of data, and on and on it goes. Head movements of this variety are considered “short seeks.” I know what the competition says about this topic in their positioning papers. Misinformation will be propagated.

Let me see if I can handle this topic in a different manner. If HP Oracle Exadata Storage Server was a totally random I/O train wreck then it wouldn’t likely be able to drive all the disks in the system at ~85MB/s. In the end, I personally think the demonstrated throughput is more interesting than an academic argument one might stumble upon in an anti-Exadata positioning paper.

Well, I think I’ll wrap this up as installment one of an on-going thread of Q&A on HP Oracle Exadata Storage Server and the HP Oracle Database Machine.

Don’t forget to read Ron Weiss’ Oracle Exadata Storage Server Technical Product Whitepaper. Ron is a good guy and it is a very informative piece. Consider it required reading-especially if you are trolling my site in the role of competitive technical marketing. <smiley>

16 Responses to “Oracle Exadata Storage Server Version 1. A “FAQ” is Born. Part I.”

Feed for this Entry Trackback Address

1 Noons September 26, 2008 at 3:35 am

“trying to plumb 12 3.5″ SAS drives worth of streaming I/O through a 1 gigabit pipe would be sheer madness”

🙂
Oh boy, do you mean the deranged M$ claims I get here about 1gbps being “plenty enough” for our I/O throughput, “because it’s in a switch”, are ALL WRONG?
LOL!

Excellent stuff, Kevin. Looooooong overdue.
Gotta catch up with it all: been sick for a coupla
days and haven’t had a chance to read/digest all this.

Reply
2 Duncan E September 26, 2008 at 9:36 am

Kevin,

2 questions:

1. I’m trying to grasp whether this is really just being pitched at the BI and data warehouse space, or whther it has real value in the OLTP space as well. Ron’s whitepaper mentions OLTP in passing:

“Eliminating data transfers and database server workload can greatly benefit data warehousing queries that traditionally become bandwidth and CPU constrained. Eliminating data transfers can also have a significant benefit on online transaction processing (OLTP) systems that often include large batch and report processing operations.”

What about in “pure” OLTP environments – is there much benefit there?

2. I know we shouldn’t set too much store in these things, but are there plans to submit TPC benchmarks?

Thanks

Duncan

Reply
3 Alan Powell September 26, 2008 at 12:02 pm

As always, thanks for the valuable information.

From the Oracle Exadata white paper:
“No cell-to-cell communication is ever done or required in an Exadata configuration.”
and a few paragraphs later:
“Data is mirrored across cells to ensure that the failure of a cell will not cause loss of data, or inhibit data accessibility”

Can both these statements be true and would we need to purchase a minimum of two cells for a small-ish ASM environment?

Thanks again, Alan

Reply
4 Mathew Butler September 26, 2008 at 12:03 pm

Hi Kevin,

Can you please also say a few words about what this new storage server means for OLTP type systems?

The design and implementation appears to me to be very focussed on making batch type processing much faster ie: queries that can most efficiently be satified through full table scans, index fast full scans, sorts, group bys, hash joins.

Can this same implementation can have some benefit an OLTP system? Some queries in an OLTP world are more about seeing the first rows of a result quickly ( the first page of rows in an online GUI report of many pages ), for example. On systems prior to this storage server, these queries might best be satisfied by the optimizer initiating an index access within a nested loop.

Some of the documentation I’ve read does talk about an advantage to OLTP systems, but can you quantify this please? Are we talking about very specific classes of OLTP system, or does this storage server have a performance advantage over existing hardware configurations for all system types?

I’m also interested as to what this hardware implementation means for database instrumentation and tuning, but I think I need to do some more reading before taking my thoughts further.

Best Regards,

Mathew Butler

Reply
5 RC September 26, 2008 at 7:12 pm

Is it possible to run more than one database on this ‘beast’?

Reply
6 David Aldridge September 26, 2008 at 8:07 pm

Kevin,

“Most Exadata reads are buffered in the PGA since the plan is generally a full scan.”

Do you mean something like, “since the plan is generally a parallel full scan.”? Serial full scans are still buffered in the SGA, although they don’t go to the MRU end of the LRU chain. Parallel full scans generally go to the PGA directly anyway so they’re not for sharing around.

If I had to guess then I’d say that serial operations still go to the SGA and do not benefit from optimising the projected columns, because that data is shared. I could see how multiple copies of a buffer could be kept in the SGA with each tagged to state the columns that it contains, but that seems like a very complex process that the designers would have left alone.

“If Exadata is servicing concurrent queries that attack different regions of the same drives then, yes, by all means there will be seeks. There are a lot of ways around this”

Yes indeed, and it’s not necessary to completely dedicate the spindle to serving a single query until it is finished before moving on to the next one. By simply reducing the head movements by to one or two per second you caneffectively time-slice the disk among multiple queries without appreciable loss ofsystem bandwidth. It is the insane flickering of the heads 100 times a second that kills performance. The Linux kernel already has the anticipatory disk scheduler built in to do exactly that. Maybe the Exadata disk performance is simply leveraging a specialised disk scheduling algorithm in this respect.

Reply
7 kevinclosson September 26, 2008 at 8:38 pm

David,

Yes, you are right that serial scans and index range scans (using scattered reads) still go into the SGA and are shared. You’ll still see scattered reads and they are not intelligent scans. Think, “PGA, PGA.”

In the end, Exadata is beyond compare the best block server for Oracle so you aren’t losing anything when you aren’t getting serviced by offloaded scans.

Reply
8 Doug Burns September 27, 2008 at 12:09 am

David/Kevin,

“Serial full scans are still buffered in the SGA, although they don’t go to the MRU end of the LRU chain.”

Well, as this thing runs 11g, that’s not strictly true any more

http://oracledoug.com/serendipity/index.php?/archives/1320-Parallel-Query-and-11g-Part-2.html

http://oracledoug.com/serendipity/index.php?/archives/1321-11g-and-direct-path-reads.html

… and although I still have to get back to blogging about this again one day, because I never finished looking at it, I wonder if this change was even slightly related to this week’s products?

Reply
9 kevinclosson September 27, 2008 at 12:39 am

Guys, please, I know about the different types of scans and their respective types of reads. I’ll make a blog entry on that soon enough. There is a new I/O wait event specifically for smart scans so I’ll close the loop when I post on that.

Reply
10 Doug Burns September 27, 2008 at 12:44 am

Sorry, maybe I should have said ‘David’ without the Kevin. I just find it interesting that the change appeared not long before this weeks stuff. I look forward to your post.

Reply
11 David Aldridge September 29, 2008 at 9:14 pm

Tsh. Everything changes, it seems.

I’m intrigued by the mention of bloom filters in the documentation. I see events related to bloom filters (I think it was latches or enqueues, I don’t remember which) in 10.2.0.4, so I don’t think it’s new to 11.1. I had to go look at a Wikipedia article to see what they were all about, but it made complete sense that they’d be associated with hash joins.

Reply
12 kevinclosson September 29, 2008 at 10:09 pm

David,

Bloom filters are also an intrinsic function of Oracle Database mostly to help out RAC. When they get pushed down to storage there is a specific entry in the query plan to show such. I have an example where I’m querying 33 Billion rows of credit card trans but only want the rows that match postal zip codes in “region west”. The bloom filter is constructed with something like 4,000 zipcodes (out of ~43,000) and pushed to each storage cell. The cells then used Bloom Filtering on every row of the 33 Billion card trans sending up only those in the set…and as is the nature of Bloom Filters, an occasional false positive. Bloom Filters cannot return false negatives but can return false positives since there is an infinite set it is dealing with. THis should make more sense when I post the blog entry.

The lesson about what goes into the SGA and PGA from one revision to the next is not so much Exadata related. Any parallel query plan that does PGA reads pre-Exadata will result in Exadata smart scans. There are exceptions such as we halt the smart scan as soon as we see a chained row since Storage has no idea what state the chained block is in (e.g., IUD underway). I’ll blog about that too at some point.

Reply
13 Doug Burns September 29, 2008 at 11:43 pm

David,

Getting away from Exadata specifics again for a moment … Christian Antonini has an article on Bloom Filters in Oracle 10gR2 here that you might be interested in

http://antognini.ch/2008/09/bloom-filters/

and Greg Rahn has blogged about them before, too

http://structureddata.org/2007/10/23/bloom-filters/

Reply
14 kevinclosson September 29, 2008 at 11:51 pm

Doug,

You’re on my blog dragging us away from Exadata? Watch out, a belt sander will drop from the ceiling and grind you to a fine powder!

🙂

Reply

	David Zheng on Announcing pgio (The SLOB Meth…
	Oracle redo log perf… on File Systems For A Database? C…
	Oracle redo log perf… on Yes, File Systems Still Need T…
	kevinclosson on Announcing SLOB 2.5.4
	pgio nutzen? - I/O W… on So pgio Does Not Accurately Re…

Kevin Closson's Blog: Platforms, Databases and Storage