Archive Page 2

Can Oracle Database Release 2 (11.2.0.3) Properly Count Cores? No. Does It Matter All That Much? Not Really..

…and with a blog post title like that who would bother to read on? Only those who find modern platforms interesting…

This is just a short, technically-light blog post to point out an oddity I noticed the other day.

This information may well be known to everyone else in the world as far as I know, but it made me scratch my head so I’ll blog it. Maybe it will help some wayward googler someday.

AWR Reports – Sockets, Cores, CPUs
I’m blogging about the Sockets/Cores/CPUs reported in the top of an Oracle AWR report.

Consider the following from a Sandy Bridge Xeon (E5-2680 to be exact) based server.

Note: These are AWR reports so I obfuscated some of the data such as hostname and instance name.

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
SLOB          3521916847 SLOB                1 29-Sep-12 05:27 11.2.0.3.0  NO

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
NNNN             Linux x86 64-bit                   32    16       2      62.87

OK, that’s simple enough. We all know that E5-2680 is an 8-core part with SMT (Simultaneous Multi-threading) enabled. Further, this was a 2U 2-socket box. So, sure, 2 sockets and a sum of 16 cores. However, with SMT I get 32 “CPUs”. I’ve quoted CPU because they are logical processors.

The next example is a cut from an old Harpertown Xeon (Xeon 5400) AWR report. Again, we all know the attributes of that CPU. It was pre-QPI, pre-SMT and it had 4 cores. This was a 2-socket box—so no mystery here. AWR is reporting 2 sockets, a sum of 8 cores and since they are simple cores we see 8 “CPUs”.

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
XXXX          1247149781 xxxx1               1 27-Feb-13 11:32 11.2.0.3.0  YES

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
xxxxxxxx.mmmmmm. Linux x86 64-bit                    8     8       2      62.88

Now The Oddity
Next I’ll show a modern AMD processor. First, I’ll grep some interesting information from /proc/cpuinfo and then I’ll show the top of an AWR report.

$ cat  /proc/cpuinfo | egrep 'processor|vendor_id|model name'
processor       : 31
vendor_id       : AuthenticAMD
model name      : AMD Opteron(TM) Processor 6272

$ head -10 mix_awr_16_8k.16.16

WORKLOAD REPOSITORY report for

DB Name         DB Id    Instance     Inst Num Startup Time    Release     RAC
------------ ----------- ------------ -------- --------------- ----------- ---
XXXXXX         501636137 XXXXXX              1 24-Feb-13 12:21 11.2.0.3.0  NO

Host Name        Platform                         CPUs Cores Sockets Memory(GB)
---------------- -------------------------------- ---- ----- ------- ----------
oel63            Linux x86 64-bit                   32    16       2     252.39

The system is, indeed, a 2-socket box. And cpuinfo is properly showing the processor model number (Opteron 6200 family). Take note as well that the tail of cpuinfo output is CPU 31 so the Operating System believes there are 32 “CPUs”. However, AWR is showing 2 sockets, a sum of 16 cores and 32 CPUs. That’s where the mystery arises. See, the Operton 6200 16-core parts (such as the 6272) are a multi-chip module (MCM) consisting of two soldered dies each with 4 “bulldozer modules.” And never forget that AMD does not do multithreading. So that’s 2x2x4 cores in each socket. However, AWR is reporting a sum of 16 cores in the box. Since there are two sockets, AWR should be reporting 2 sockets, a sum of 32 cores and 32 CPUs. Doing so would more accurately follow the convention we grew accustomed to in the pre-Intel QPI days—as was the case above with the Xeon 5400.

In summary, none of this matters much. The Operating System knows the cores are there and Oracle thinks there are 32 “CPUs”. If you should run across a 2-socket AMD Operton 6200-based system and see this oddity, well, it won’t be so odd any longer.

Multiple Multi-Core Modules on Multiple Dies Glued Together (MCM)?
…and two of them in one system? That’s the “N” In NUMA!

Can anyone guess how many NUMA nodes there are when a 2-Socket box with AMD 6272 parts is booted at the BIOS with NUMA on? Does anyone know what the model is called when one boots NUMA x64 hardware with NUMA disabled in the BIOS (or grub.conf numa=off)? Well, SUMA, of course!

My Oaktable World 2012 Video Session Is Now Online

Oaktable World 2012 was an event held during last year’s Oracle OpenWorld 2012 at a venue within walking distance of the Moscone Center. More information about Oaktable World can be found here.

The venue lent itself to good deep-technical discussions and free-thinking. However, as people who attended OpenWorld 2012 know, San Francisco was enduring near all-time record high temperatures. It must have been 98F inside the venue. The heat was only so much fun. I had to throw in a pretty nasty head cold. All of that aside, I took the podium one afternoon and was pleased to have a full house to present to.

The slides I brought touched on such topics as performance per core across generations of x64 hardware and methodologies for studying such things. I also spoke of Intel’s Turbo Boost 2.0 and how folks should add clock frequency monitoring tools to their standard bad of tricks.

The final master of the video is the fruit of Marcin Przepiorowski’s labor. For some reason there was a lot of audio/video troubles in the master. Marcin really outdid himself to stitch all this back together. Thanks, Marcin.

So, a lot was lost from the session—to include the Q/A. However, I’d like to offer a link to the video and open this post up for questions on the material.

The video can be found here.

 

Using Linux Perf(1) To Analyze Database Processes – Part I

Troubleshooting Runaway Processes
Everyone reads Tanel Poder’s material—for good reason.

I took particular interest in his recent post about investigating where an apparently runaway, cpu-bound, Oracle foreground process is spending its time.  That article can be found here.

I’ve been meaning to do some blogging about analyzing Oracle execution with perf(1). I think Tanel’s post is a good segue for me to do so. You might ask, however, why I would bother attempting to add value in a space Tanel has already blogged. Well, to that I would simply say that modern systems professionals need as many tools as they can get their hands on.

Tanel, please ping me if you object, for any reason, to my direct links to your scripts.

Monitoring Window
Tanel’s approach to monitoring his cpu-bound foreground process is based on the pstack(1) utility. Once he identified the spinning process he fired off a 100 pstack commands in a loop. For each iteration, the output of pstack was piped through a text processing script (os_explain).  From my reading of that script I estimate it probably takes about 10 milliseconds to process the data flowing into it. I’m estimating here so please bear with me. If pstack execution requires zero wall clock time I think these 100 snoops at the process stack likely occur in 1 second of wall clock time. According to Tanel, the SQL takes about 23 minutes to complete. If the foreground process is looping just a small amount of code then I see no problem monitoring a small portion of its overall execution time. Since you can see what Tanel is doing you know it would be simple to grab the process every so often throughout its life and monitor the approximate 1 second with the 100 iterations of pstack. The technique and tools Tanel shares here are extensible. That is good.

Perturbation
The Linux pstack utility stops the execution of a process in order to read its stack and produce output to standard out. Performance engineering techniques should always be weighed against the perturbation levied by the monitoring method.  I recall the olden days of CA Unicenter brutally punishing a system just to report performance metrics. Ouch.

Monitoring a cpu-bound process, even periodically, with pstack will perturb performance. There is such a thing as an acceptable amount of perturbation though. Once must decide for themselves what that acceptable level is.

I would like to offer an example of pstack perturbation. For this exercise I will use the silly core-stalling program I call “fat.” This program suffers a lot of processor stalls because it has very poor cache locality. The program can be found here along with its cousin, “skinny.” Aptly named, skinny fits in cache and does not stall.  The following shows that the program executes in 51.251 seconds on my Nehalem Xeon server when executed in isolation.  However, when I add 100 pokes with pstack the run time suffers 5% degradation.

$
$ cat p.sh
#!/bin/bash

time ./fat &
p=`ps -f | awk ' $NF ~ /.fat/ { print $2 }'`

for i in {1..100}
do
	pstack $p
done  > /dev/null 2>&1
wait
$
$ sh ./p.sh

real	0m53.836s
user	0m50.604s
sys	0m0.024s
$
$ time ./fat

real	0m51.251s
user	0m51.221s
sys	0m0.013s
$

I know 5% is not a lot but this is a cpu-bound process so that is a lot of cycles. Moreover, this is a process that is not sharing any data nor exercising any application concurrency mechanisms (e.g., spinlocks, semaphores).  I understand there is little danger in perturbing the performance of some runaway processes but if such a process is also sharing data there can be ancillary affect when troubleshooting a large-scale problem.  Before I move on I’d like to show how the execution time of “fat” changes when I gather 100 pstacks as soon as the process is invoked and then wait 10 seconds before gathering another 100 stack readings. As the following shows, the perturbation meter climbs up to 13%:

$ cat p2.sh
#!/bin/bash

time ./fat &
p=`ps -f | awk ' $NF ~ /.fat/ { print $2 }'`

for t in 1 2
do
	for i in {1..100}
	do
		pstack $p
	done  > /dev/null 2>&1
sleep 10
done

wait
$ sh ./p2.sh

real	0m57.930s
user	0m51.480s
sys	0m0.050s
$

Performance Profiling With perf(1)
As I said, I’ve been trying to work out the time to blog about how important perf(1) should be to you in your Oracle monitoring—or any performance analysis for that matter. It requires a modern Linux distribution (e.g., RHEL 6, OEL 6) though. I can’t write a tutorial on perf(1) in this post. However, since I’m on the topic of perturbation via monitoring tools, I’ll offer an example of perf-record(1) using my little “fat” program:

$ time perf record ./fat
[ perf record: Woken up 8 times to write data ]
[ perf record: Captured and wrote 1.812 MB perf.data (~79187 samples) ]

real	0m52.299s
user	0m52.234s
sys	0m0.046s
$

$ perf report --stdio | grep fat
# cmdline : /usr/bin/perf record ./fat
    99.81%      fat  fat                [.] main
     0.14%      fat  [kernel.kallsyms]  [k] __do_softirq
     0.02%      fat  libc-2.12.so       [.] __memset_sse2
     0.01%      fat  [kernel.kallsyms]  [k] clear_page_c
     0.01%      fat  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
     0.00%      fat  [kernel.kallsyms]  [k] get_page_from_freelist
     0.00%      fat  [kernel.kallsyms]  [k] free_pages_prepare

So, we see that monitoring with perf(1) does levy a small tax—2%. However, I just monitored the entire execution of the program.

Monitor Oracle Foreground Process With perf-record(1)
Now I’ll run the program Tanel used in his example.  I’ll record 5 minutes of execution by identifying the process ID of the spinning shadow process and then using the –p option to perf-record(1). The way perf(1) works is to monitor the entire execution of the program you attach to—unless you give it a process to keep it busy (or use the -c option). Now don’t be confused. In the following example I’m telling perf-record(1) to monitor the shadow process. The usage of sleep 300 is just my way of telling it to finish in 5 minutes.

$ cat t.sh

sqlplus / as sysdba < /dev/null 2>&1 &
[1] 462
$ ps -f
UID        PID  PPID  C STIME TTY          TIME CMD
oracle     462  2259  0 10:11 pts/0    00:00:00 sh ./t.sh
oracle     463   462  0 10:11 pts/0    00:00:00 sqlplus   as sysdba
oracle     465  2259  0 10:11 pts/0    00:00:00 ps -f
oracle    2259  2258  0 Feb14 pts/0    00:00:00 -bash
$ ps -ef | grep 463 | grep -v grep
oracle     463   462  0 10:11 pts/0    00:00:00 sqlplus   as sysdba
oracle     464   463 89 10:11 ?        00:00:14 oracleSLOB (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
$ sudo perf record -p 464 sleep 300
[ perf record: Woken up 35 times to write data ]
[ perf record: Captured and wrote 8.755 MB perf.data (~382519 samples) ]
$

$ perf report
$ perf report --stdio
# ========
# captured on: Mon Feb 18 10:17:02 2013
# ========
#
# Events: 286K cpu-clock
#
# Overhead  Command      Shared Object                       Symbol
# ........  .......  .................  ...........................
#
    29.84%   oracle  oracle             [.] kglic0
    11.08%   oracle  oracle             [.] kgxExclusive
    11.02%   oracle  oracle             [.] kglGetHandleReference
     6.66%   oracle  oracle             [.] kglGetMutex
     6.47%   oracle  oracle             [.] kgxRelease
     4.91%   oracle  oracle             [.] kglGetSessionUOL
     4.48%   oracle  oracle             [.] kglic_cbk
     3.79%   oracle  oracle             [.] kgligl
     3.44%   oracle  oracle             [.] kglMutexHeld
     3.12%   oracle  oracle             [.] kgligp
     1.96%   oracle  oracle             [.] kqlpgCallback
     1.92%   oracle  oracle             [.] kglReleaseMutex
     1.88%   oracle  oracle             [.] kglGetBucketMutex
     1.80%   oracle  oracle             [.] kglReleaseBucketMutex
     1.47%   oracle  oracle             [.] kglIsMutexHeld
     1.18%   oracle  oracle             [.] kglMutexNotHeld
     1.04%   oracle  oracle             [.] kglReleaseHandleReference
     0.51%   oracle  oracle             [.] kghalf
     0.33%   oracle  oracle             [.] qerfxFetch
     0.32%   oracle  oracle             [.] lnxmin
     0.31%   oracle  oracle             [.] qerfxGCol
     0.26%   oracle  oracle             [.] qeruaRowProcedure
     0.24%   oracle  oracle             [.] kqlfnn

$ perf report --stdio | grep oracle | sed 's/\%//g' | awk '{ t=t+$1 } END { print t }'
99.98
$

If you study Tanel’s post and compare it to mine you’ll see differences in the cost accounting Tanel has associated with certain Oracle kernel routines. That’s no huge problem. I’d like to explain what’s happening.

When I used perf-record(1) to monitor the shadow process for 300 seconds it collected 382,519 samples. When using a pstack approach, on the other hand, just be aware that you are getting a glimpse of whatever happens to be on the stack when the program is stopped. You might be missing a lot of important events. Allow me to offer an illustration of this effect.

Do You see What I See?
Envision a sentry  guarding a wall and taking a photo every 1 second. Intruders jumping over the wall with great agility are less likely to be in view when he snaps his photo. On the other hand, an intruder taking a lot longer to cross the wall (carrying a large bag of “loot” for instance) suffers greater odds of showing up in one of his photos. The sentry might get 10 photos of slow intruders while missing hundreds of intruders who happen to be more nimble–thus getting over the wall more quickly.  For all we know, the slow intruder is carrying bags of coal while the more nimble intruders are packing a pockets full of diamonds. Which intruder matters more? It depends on how cold the sentry is I guess :-) It’s actually a wee-bit like that with software performance analysis. Catch me some time and I’ll explain why a CPI (cycles per instruction)  of 1 is usually bad thing. Ugh, I digress.

Tanel’s approach/kit works on a variety of operating systems and versions of Linux. It also does not require any elevated privilege. I only aim to discuss the differences.

Summary
I’ve lightly covered the topic of performance monitoring perturbation and have given a glimpse into what I envision will end up as a multiple-part series on perf(1).



							

Do IT Vendors Ever Do Real “Knowledge Sharing” ?

I’m working out my 2013 high-level plans and something dawned on me. I’m not sure I know what the term knowledge-sharing means any more.

Large IT vendors treat their top-100 customers with kid gloves. I know this, and I think it is just fine. Kid-gloves meetings occur frequently. These visits often include “knowledge-sharing” sessions where the IT vendor pitches their latest solutions and, perhaps, roadmaps for future products. IT vendors come back from these meetings with “top 10″ product deficiency lists and sometimes items from those lists do make it into internal roadmaps or even PRD (product requirement documents).  That’s good. However, I wonder how much of that is thinking inside the box—or better yet, list items to make one’s life inside the box a little better.  Or rather, how much time is spent discussing Y2K-era hangover problems? Trust me, I could convince you that a lot of what is heralded as state-of-the-art IT kit being pushed today is actually just stuff that addresses Y2K-era problems that largely do not exist with today’s hardware. When you read a current datasheet from any IT vendor ask yourself whether someone just plopped a quarter in the juke box to spin Party Like It’s 1999.

Who Is talking to You?

I feel we in the IT community are facing some serious changes. Indeed, I’ll quote myself from the interview I did last year for the Northern California User’s Group Journal (here):

Everything we know in IT has a shelf life

Ok, yes, I just quoted myself. I’m sorry, but it helps me get my point across.

I’m thinking about visiting some customers this year for real knowledge-sharing. I emphasize sharing because I want to learn from you. I want to know about where you need to go so I can help figure out how to get you there.  But I need to get the right audience. Audiences that are, for instance, not convinced we need to solve problems that no longer exists (thus my quote from that article).

So, for example, let’s say I showed up with the type of material I presented in this 2010 Hotsos Symposium presentation .  Does your IT shop even have the type of IT personnel that would find this sort of material interesting? Would they attend? If they did attend, would they then let me pick their brains about where they need to take their IT shop so I can help build the right products?

How about a poll?

Luca Canali’s Wonderful Public Oracle-Related Materials

This is just a quick blog post to direct readers to treasure trove of public information offered by Luca Canali (of CERN).

Let there be no mistake. These are not just ordinary presentations. Many of them have detailed how-to information in the speaker notes to amplify the information in the slides. They fall into the class of must-read.

Enjoy, and thanks, to Luca!

The list:

Compressing very large data sets in Oracle, UKOUG Conference 2009, Birmingham, Dec 2009.

Upcoming: Active Data Guard at CERN, UKOUG 2012 conference, Birmingham, December 4th 2012

Testing Storage for Oracle RAC 11g with NAS, ASM, and SSD Flash Cache, UKOUG 2011 conference, Birmingham, December 6th, 2011

CERN IT-DB Deployment, Status, Outlook – ESA-GAIA DB Workshop, ISDC, March 2011

ACFS under scrutinyUKOUG Conference 2010, Birmingham, Nov 2010

Data Lifecycle Management Challenges and Techniques, a user’s experienceUKOUG Conference 2010, Birmingham, Nov 2010

CERN DB Services for Physics – 2010 Report, Distributed Database Workshop, CERN, Nov 2010

Overview of the CERN DB Services for Physics, Orcan Swedish Oracle Users group Conference, Stockholm, May 2010

Evaluating and testing storage performance for Oracle DBs, UKOUG Conference 2009, Birmingham, Dec 2009.

ASM configuration review, Distributed Database Operations Workshop, CERN, November 2009.

Storage for data management, ‘after C5′ CERN-IT presentation, CERN, May 2009.

Data Lifecycle Review and Outlook, Distributed Database Operations Workshop, Barcelona, April 2009

Database Operations Security Overview, Distributed Database Operations Workshop, Barcelona, April 2009.

Overview of Storage and IT-DM lessons learned, IT-DM technical group meeting, CERN, March 2009.

Implementing ASM Without HW RAID, A User’s Experience, presentation at UKOUG Conference 2008, Birmingham Dec 2008.

Datalifecycle_WLCG_DB_workshop_LC: Data lifecycle management ideas, WLCG Database Workshop, CERN Nov 2008.

Workhops_8_Jul_case_PVSS_LC: Oracle performance tuning case study, PVSS archiver, Database Developers Workshop, CERN 8-7-2008.

Workhops_8_Jul_perf_for_developers_LC: Oracle performance tuning ideas for developers, Database Developers Workshop, CERN 8-7-2008.

Oracle Storage Performance Studies: Scalability tests of VLDBs with RAC and ASM for Physics DB Services, WLCG Workshop, CERN, Apr 2008.

A closer look inside Oracle ASM, talk on Oracle ASM internals and performance measurements at UKOUG 2007, Birmingham, December 2007.

WLCG_Oracle_perf_for_admin: Oracle Performance for Administrators, WLCG Reliability Workshop, CERN November 2007.

Oracle_CERN_Service_Architecture, CERN DB workshop, January 2007.

UKOUG_RACSig_Oct06, UKOUG RAC SIG meeting, UKOUG RACSIG meeting, London, October 2006.

DB services at CERN HEPIX 2006: “Database Services for Physics at CERN with Oracle 10g RAC”, HEPIX conference, Rome, April 2006.

DB_Serv_Meeting_ASM_Perf: “ASM-based storage to scale out Database Services for Physics”, April 2006

3D_RAL_Mar_06: “Oracle 10gR2 configuration”, 3D DB Workshop at RAL (UK), March 2006.

T2_tutorials_Jun06_OracleRAC: Oracle and RAC for Physics DB June 2006

For more information visit Luca Canali’s page here: http://canali.web.cern.ch/canali/main.htm

Oracle Exadata X3 Database In-Memory Machine: Timely Thoughtful Thoughts For The Thinking Technologist – Part I

Oracle Exadata X3 Database In-Memory Machine – An Introduction
On October 1, 2012 Oracle issued a press release announcing the Oracle Exadata X3 Database In-Memory Machine. Well-chosen words, Oracle marketing, surgical indeed.

Words matter.
Games Are Games–Including Word Games
Oracle didn’t issue a press release about Exadata “In-Memory Database.” No, not “In-Memory Database” but “Database In-Memory” and the distinction is quite important. I gave some thought to that press release and then searched Google for what is known about Oracle and “in-memory” database technology. Here is what Google offered me:

Note: a right-click on the following photos will enlarge them.

 

With the exception of the paid search result about voltdb, all of the links Google offered takes one to information about Oracle’s Times Ten In-Memory Database which is a true “in-memory” database. But this isn’t a blog post about semantics. No, not at all. Please read on.

Seemingly Silly Swashbuckling Centered on Semantics?
Since words matter I decided to change the search term to “database in-memory”  and was offered the following:

So, yes, a little way down the page I see a link to webpage about Oracle’s freshly-minted term: “Database In-Memory”.

I’ve read several articles covering the X3 hardware refresh of Oracle Exadata Database Machine and each has argued that Oracle Real Application Clusters (RAC), on a large-ish capacity server, does not qualify as “in-memory database.” I honestly don’t see the point in that argument. Oracle is keenly aware that the software that executes on Exadata is not “in-memory database” technology. That’s why they wordsmith the nickname of the product as they do.  But that doesn’t mean I accept the marketing insinuation nor the hype. Indeed, I shouldn’t accept the marketing message–and neither should you–because having the sacred words “in-memory” in the nickname of this Exadata product update is ridiculous.

I suspect, however, most folks have no idea just how ridiculous. I aim to change that with this post.

Pictures Save A Thousand Words
I’m horrible at drawing, but I need to help readers visualize Oracle Exadata X3 Database In-Memory Machine at a component level. But first, a few words about how silly it is to confuse Exadata Smart Flash Cache with “memory” in the context of database technology.

But A Bunch Of PCI Flash Cards Qualify As Memory, Right?
…only out of necessity when playing word games. I surmise Oracle needed some words to help compete against HANA–so why not come up with a fallacious nickname to confuse the fact? Well, no matter how often a fallacy is repeated it remains a fallacy. But, I suppose I’m just being pedantic. I trust the rest of this article will convince readers otherwise.

If PCI Flash Is Memory Why Access It Like It’s a Disk?
…because the flash devices in Exadata Smart Flash Cache are, in fact, block devices. Allow me to explain.

The PCI flash cards in Exadata storage servers (aka cells) are presented as persistent Linux block devices to, and managed by, the user-mode Exadata Storage Server software (aka cellsrv).

In Exadata, both the spinning media and the PCI flash storage are presented by the Linux Kernel via the same SCSI block internals. Exadata Smart Flash Cache is just a collection of SCSI block devices accessed as SCSI block devices–albeit with lower latency than spinning media. How much lower latency? About 25-fold faster than a high-performance spinning disk (~6ms vs ~250us).  Oh, but wait, this is a blog post about “Database In-Memory” so how does an 8KB read access from Exadata Smart Flash Cache compare to real memory (DRAM)? To answer that question one must think in terms relevant to CPU technology.

In CPU terms, “reading” an 8KB block of data from DRAM consists of loading 128 cachelines into processor cache.  Now, if a compiler were to serialize the load instructions it would take about 6 microseconds (on a modern CPU) to load 8KB into processor cache–about 1/40th the published (datasheet) access times for an Exadata Smart Flash Cache physical read (of 8KB). But, serializing the memory load operations would be absurd because modern processors support numerous concurrent load/store operations and the compilers are optimized for this fact. So, no, not 6us–but splitting hairs when dissecting an absurdity is overkill. Note: the Sun F40 datasheet does not specify how the published 251us 8KB read service time is measured (e.g., end-to-end SCSI block read by the application or other lower-level measurement) but that matters little. In short, Oracle’s “Database In-Memory” data accesses are well beyond 40-fold slower than DRAM. But it’s not so much about device service times. It’s about overhead. Please read on…

High Level Component Diagram
The following drawing shows a high-level component breakout of Exadata X3-2. There are two “grids” of servers each having DRAM and Infiniband network connectivity for communications between database hosts and to/from the “storage grid.” Each storage server (cell) in the storage grid has spinning media and (4 each) PCI flash cards. The cells run Oracle Linux.

Exactly When Did Oracle Get So Confused Over The Difference Between PCI Flash Block Devices And DRAM?
In late 2009 Oracle added F20 PCI flash cards as a read-only cache to the second generation Exadata known as V2–or perhaps more widely known as The First Database Machine for OLTP[sic]. Fast forward to the present.

Oracle has upgraded the PCI flash to 400GB capacity with the F40 flash card.  Thus, with the X3 model there is 22.4TB of aggregate PCI flash block device capacity in a full-rack. How does that compare to the amount of real memory (DRAM) in a full-rack? The amount of real memory matters because this is a post about “Database In-Memory.”

The X3-2 model tops out at 2TB of aggregate DRAM (8 hosts each with 256GB) that can be used (minus overhead) for database caching (up from 1152 GB in the X2 generation). The X3-8, on the other hand, offers an aggregate of 4TB DRAM for database caching (up from the X2 limit of 2TB). To that end the ratio of flash block device capacity to host DRAM is 11:1 (X3-2) or 5.5:1 (X3-8).

This ratio matters because host DRAM is where database processing occurs. More on that later.

Tout “In-Memory” But Offer Worst-of-Breed DRAM Capacity
Long before Oracle became confused over the difference between PCI flash block devices and DRAM, there was Oracle’s first Database In-Memory Machine. Allow me to explain.

In the summer of 2009, Oracle built a 64-node cluster of HP blades with 512 Harpertown Xeons cores and paired it to 2TB of DRAM (aggregate). Yes, over three years before Oracle announced the “Database In-Memory Machine” the technology existed and delivered world-record TPC-H query performance results–because that particular 1TB scale benchmark was executed entirely in DRAM! And whether “database in-memory” or “in-memory database”, DRAM capacity matters. After all, if your “Database In-Memory” doesn’t actually fit in memory it’s not quite “database in-memory” technology.

Oracle Database has offered the In-Memory Parallel Query feature since the release of Oracle Database 11g R2.

So if Database In-Memory technology was already proven (over 3 years ago) why not build really large memory configurations and have a true “Database In-Memory? Well, you can’t with Exadata because it is DRAM-limited–most particularly the X3-2 model which is limited to 256GB per host. On the contrary, every tier-one x64 vendor I have investigated (IBM,HP,Dell,Cisco,Fujitsu) offers 768GB 2-socket E5-2600 based servers. Indeed, even Oracle’s own Sun Server X3-2 supports up to 512GB RAM–but only, of course, when deployed independently of Exadata. Note: this contrast in maximum supported memory spans 1U and 2U servers.

So what’s the point in bringing up tier 1 vendors? Well, Oracle is making an “in-memory” product that doesn’t ship with the maximum memory available for the components used to build the solution. Moreover, the Database In-Memory technology Oracle speaks of is generic database technology as proven by Oracle’s only benchmark result using the technology (i.e., that 1TB scale 512-core TPC-H).

OK, OK, It’s Obviously Not The “Oracle Exadata X3 Database In-DRAM Machine” So What Is It?

So if not DRAM, then what is the “memory” being referred to in Exadata X3 In-Memory Database Machine?  I’ll answer that question with the following drawing and if it doesn’t satisfy your quest for knowledge on this matter please read what follows:

Presto-magic just isn’t going to suffice. The following drawing clearly points out where Exadata DRAM cache exists:

That’s right. There has never been DRAM cache in the storage grid of Exadata Database Machine.

The storage grid servers’ (cells) DRAM is used for purposes of buffering and the metadata managing such things as Storage Index and mapping what HDD disk blocks are present (copies of HDD blocks) in the PCI flash (Exadata Smart Flash Cache).

Since the solution is not “Database In-Memory” the way reasonable technologists would naturally expect–DRAM–then where is the “memory?” Simple, see the next drawing:

Yes, the aggregate (22TB) Exadata Smart Flash Cache is now known as “Database In-Memory” technology. Please don’t be confused. That’s not PCI flash accessed with memory semantics. No, that’s PCI flash accessed as a SCSI device. There’s more to say on the matter.

Please examine the following drawing. Label #1 shows where Oracle Database data manipulation occurs. Sure, Exadata does filter data in storage when tables and indexes are being accessed with full scans in the direct path, but cells do not return a result set to a SQL cursor nor do cells modify rows in Oracle database blocks. All such computation can only occur where label #1 is pointing–the Oracle Instance.

The Oracle Instance is a collection of processes and the DRAM they use (some DRAM for shared cache e.g., SGA and some private e.g., PGA).

No SQL DML statement execution can complete without data flowing from storage into the Oracle Instance. That “data flow” can either be a) a complete Oracle database disk blocks (into the SGA) or b) filtered data (into the PGA).  Please note, I’m quite aware Exadata Smart Scan can rummage through vast amounts of data only to discover there are no rows matching the SQL selection criteria (predicates). I’m also aware of how Storage Indexes behave. However, this is a post about “Database In-Memory” and Smart Scan searching for non-existent needles in mountainous haystacks via physical I/O from PCI block devices and spinning rust (HDD) is not going to make it into the conversation. Put quite succinctly, only the Oracle Database  instance can return a result set to a SQL cursor–storage cells merely hasten the flow of tuples (Smart Scan is essentially a row source) into the Oracle Instance–but not to be cached/shared because Smart Scan data flows through the process private heap (PGA).

Exadata X3: Mass Memory Hierarchy With Automatic What? Sounds Like Tiering.
I’m a fan of auto-tiering storage (e.g., EMC FAST). From Oracle’s marketing of the “Database In-Memory Machine” one might think there is some sort of elegance in the solution along those lines. Don’t be fooled.  I’ll quote Oracle’s press release :

the Oracle Exadata X3 Database In-Memory Machine implements a mass memory hierarchy that automatically moves all active data into Flash and RAM memory, while keeping less active data on low-cost disks.

Sigh. Words Matter! Oracle Database does not draw disk blocks into DRAM (the SGA or PGA) unless the data is requested by SQL statement processing. The words that matter in that quote are: 1) “automatically moves” and 2) “keeping.” Folks, please consider the truth on this matter. Oracle doesn’t “move” or “keep.” It copies. The following are the facts:

  1. All Oracle Database blocks are stored persistently on hard disk (spinning) drives. “Active data” is not re-homed onto flash as Oracle press release insinuates. The word “move” is utterly misleading in that context. The accurate word would be “copy.”
  2. A block of data read from disk can be automatically copied into flash when read for the first time.
  3. If a block of data is not being accessed by an Exadata Smart Scan the entire block is cached in the host DRAM (SGA). Remember the ratios I cited above. It’s a needles-eye situation.
  4. If a block of data is being accessed by an Exadata Smart Scan only the relevant data flows through host DRAM (PGA) but is not cached.

In other words, “Copies” != “moves” and “keeping” insinuates tiering. Exadata does neither, it copies. Copies, copies, copies and only sometimes caches in DRAM (SGA). Remember the ratios I cited above.

Where Oh Where Have All My Copies Of Disk Blocks Gone?
The following drawing is a block diagram that depicts the three different memory/storage device hierarchies that hold a copy of a block (colored red in the drawing) that has flowed from disk into the Oracle Instance. The block is also copied into PCI flash (down in the storage grid) for future accelerated access.

Upon first access to this block of data there are 3 identical copies of the data. Notice how there is no cached copy of this block in the storage grid DRAM. As I already mentioned, there is no DRAM block cache in storage.

If, for example, the entirety of this database happened to fit into this one red color-coded block, the drawing would indeed represent a “Database In-Memory.” For Exadata X3 deployments serving OLTP/ERP use cases, this drawing accurately depicts what Oracle means when they say “Database In-Memory.”  That is, if it happens to fit in the aggregate Oracle Instance DRAM (the SGA) then you have a “Database In-Memory” solution. But, what if the database either a) doesn’t fit entirely into host DRAM cache and/or b) the OLTP/ERP workload occasionally triggers an Exadata Smart Scan (Exadata offload processing)? The answer to “a” is simple: LRU. The answer to “b” deserves another drawing. I’ll get to that, but first consider the drawing depicting Oracle’s implied “automatic” “tiering”:

I’ve Heard Smart Scan Complements “Database In-Memory”
Yeah, I’ve heard that one too.

The following drawing shows the flow of data from a Smart Scan. The blocks are being scanned from both HDD and flash block devices (a.k.a Smart Flash Cache) because Exadata scans Flash and HDD concurrently as long as there are relevant copies of blocks in flash. Important to note, however, is that the data flowing from this operation funnels through the non-shared Oracle memory known as PGA on the database host . The PGA is not shared. The PGA is not a cache. The effort expended by the “Database In-Memory Machine” in this example churns through copies of disk blocks present in flash block devices and HDD to produce a stream of data that is, essentially, “chewed and spat out” by the Oracle Instance process on the host. That is, the filtered data is temporal and session-specific and does not benefit another session’s query. To be fair, however, at least the blocks copied from HDD into flash will stand the chance of benefiting the physical disk access time of a future query.

The fundamental attributes of what the industry knows as “in-memory” technology is to have the data in a re-usable form in the fastest storage medium possible (DRAM).

We see neither of these attributes with Oracle Exadata X3 Database In-Memory Machine:

It Takes a Synopsis To Visualize The New “In-Memory” World Order
I’m not nit-picking the semantic differences between “in-memory database” and “database in-memory” because as I pointed out already Oracle has been doing “Database In-Memory” for years. I am, however, going to offer a synopsis of what is involved when Oracle Exadata X3 Database In-Memory Machine processes a block in an OLTP/ERP-style transaction. After all, Exadata is the World’s First Database Machine for OLTP and now that it is also Database In-Memory technology it might interest you to see how that all works out.

Consider the following SQL DML statement:

UPDATE CN_COMMISSION_HEADERS_ALL set DISCOUNT_PERCENTAGE = 99 WHERE TRX_BATCH_ID = 42;

This SQL is a typical OLTP/ERP rowid-based operation. In the following synopsis you’ll notice how an Exadata X3 Database In-Memory Machine transaction requires a wee-bit more than a memory access. Nah, I’ll put the sarcasm aside. This operations involves an utterly mind-boggling amount of overhead when viewed from the rosy lenses of “in-memory” technology. The overhead includes such ilk as cross-server IPC, wire time, SCSI block I/O, OS scheduling time, CPU-intensive mutex code and so forth. And, remember, this is a schematic of accessing a single block of data in the Exadata X3 Database In-Memory Machine:

  • On the Host
    • Oracle foreground process (session) suffers SGA cache lookup/miss
      • <spinlocks, SGA metadata block lookup (cache buffers chains, etc)>
    • Oracle foreground LRUs a SGA buffer
      • <spinlocks cache buffers LRU, etc>
    • Oracle foreground performs an Exadata cell single block read
      • Oracle foreground messages across iDB (Infiniband) with request for block
        • <reliable datagram sockets IPC>
      • Oracle foreground process goes to sleep on a single block read wait event
      • <interrupt>
      • <request message DMA to Infiniband HCA “onto the wire”>
      • <wire time>
  • On Storage Server
    • <interrupt>
    • <request message DMA from Infiniband HCA into Cellsrv>
    • <Cellsrv thread scheduling delay>
    • Cellsrv gets request and evaluates e.g., specific block to read, IORM metadata, etc
    • Cellsrv allocates a buffer (to read the block into, also serves as send buffer)
    • Cellsrv block lookup in directory (to see if it is in Flash Cache, for this example the block is in the Flash Cache)
    • Cellsrv reads the block from flash Linux block device(Exadata Smart Flash Cache devices are accessed with the same library API calls as spinning disk.)
      • Libaio or LibC read (non-blocking either way since Cellsrv is threaded)
        • <kernel mode>
        • <block layer code>
        • <driver code>
      • <interrupt>
      • < DMA block from PCI Flash block device into send buffer/read buffer> THIS STEP ALONE IS 251us AS PER ORACLE DATASHEET
      • <interrupt>
        • <kernel mode>
        • I/O marked complete
        • <Delay until Cellrv “reaps” I/O completion >
    • Cellsrv processes I/O completion (library code)
    • Cellsrv validates buffer contents (DRAM access)
    • Cellsrv sends block (send buffer) via iDB (Infiniband)
      • <interrupt>
      • <IB HCA DMA buffer onto the wire>
      • <wire time>
  • Back on Host
    • <interrupt>
    • <IB HCA DMA block from wire into SGA buffer)
    • Oracle foreground process marked in runnable state by Kernel
    • <process scheduling time>
    • Oracle foreground process back on CPU
    • Oracle foreground validates SGA buffer
    • Oracle foreground chains the buffer in (for others to share)
      • <spinlocks, cache buffers chains, etc>
    • Oracle foreground locates data in the buffer (a.k.a “row walk”)

Wow, that’s a relief! We can now properly visualize what Oracle means when they speak of how “Database In-Memory Machine” technology is able to better the “World’s First Database Machine for OLTP.”

Summary
We are not confused over the difference between “database in-memory” and “in-memory database.”

Oracle have given yet another nickname to the Exadata Database Machine. The nickname represents technology that has been proven outside the scope of Exadata even 3 years ago. Oracle’s In-Memory Parallel Query feature is not Exadata-specific and that is a good thing. After all, if you want to deploy a legitimate Oracle database with the “database in-memory” approach you are going to need products from the best-of-breed server vendors since they’ve all stepped up to produce large-memory servers based on the hottest, current Intel technology–Sandy Bridge.

This is a lengthy blog entry and Oracle Exadata X3 Database In-Memory Machine is still just a nickname. You can license Oracle database and do a lot better. You just have know the difference between DRAM and SCSI block devices and choose best of breed platform partners. After all those of us who are involved with bringing best of breed technology to market are not confused at all by nicknames.

Sources:

  1. 25 years of Oracle and platforms experience. I didn’t just fall off the turnip truck.
  2. Exadata uses RDS (Reliable Datagram Sockets) with is OFED Open Source. The Synopsis of an Oracle-flavored “In-Memory” operation is centered on my understanding of RDS. You too can be an expert at RDS because it is Open source.
  3. docs.oracle.com
  4. Exadata data sheets (oracle.com)
  5. Google
  6. Expert Oracle Exadata – Apress
  7. http://www.oracle.com/technetwork/database/exadata-smart-flash-cache-twp-v5-1-128560.pdf
  8. http://www.oracle.com/technetwork/server-storage/engineered-systems/exadata/exadata-smart-flash-cache-366203.pdf
  9. http://www.aioug.org/sangam12/Presentations/20205.pdf

In other words, this is all public domain knowledge assembled for your convenience.

My First Words About Oracle Exadata X3 In-Memory Database Machine

I’ve had countless emails from readers asking for a technical analysis of what Oracle announced at Openworld 2012 pertaining to the X3 refresh of Exadata Database Machine. I attended the show, fell ill and subsequently had a a lot of work backlog to clear. I will get to this next week and, not surprising to readers of this blog, I’ll take aim on  the following words: “Database In-Memory Machine” as they appear in the new marketing nickname for Exadata Database Machine.

Yes, I will blog the matter but would first like to recommend the following excellent blog posts by @flashdba as they relate to “Database In-Memory Machine”:

In Memory Databases Part I

In Memory Databases Part II

Note: Part II has one tiny bit of errata as discussed in the comment section of the post. The post speaks of the cache hierarchy of X3 and includes Exadata Storage Server DRAM in the aggregate. I need to point out that we know from reading the myriads of public information on the matter (Oracle’s whitepapers, employee blogs and Expert Oracle Exadata (Apress)) that the DRAM in Exadata Storage Server cells is not used for cache. DRAM in the storage servers is used for management (metadata) of Exadata Smart Flash Cache contents, Storage Indexes metadata and buffering (IB dend/receive buffers, HCC decompression output, etc). The cache hierarchy of X3 is quite succinctly host DRAM (SGA buffers and Results Cache) and Exadata Smart Flash Cache (the PCI flash devices accessed via SCSI disk driver through the Linux block I/O layer in the cells).

 


EMC Employee Disclaimer

The opinions and interests expressed on EMC employee blogs are the employees' own and do not necessarily represent EMC's positions, strategies or views. EMC makes no representation or warranties about employee blogs or the accuracy or reliability of such blogs. When you access employee blogs, even though they may contain the EMC logo and content regarding EMC products and services, employee blogs are independent of EMC and EMC does not control their content or operation. In addition, a link to a blog does not mean that EMC endorses that blog or has responsibility for its content or use.

This disclaimer was put into place on March 23, 2011.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 1,937 other followers

Oracle ACE Program Status

Click It

website metrics

Fond Memories

Copyright

All content is © Kevin Closson and "Kevin Closson's Blog: Platforms, Databases, and Storage", 2006-2013. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Kevin Closson and Kevin Closson's Blog: Platforms, Databases, and Storage with appropriate and specific direction to the original content.

Follow

Get every new post delivered to your Inbox.

Join 1,937 other followers