You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II.

Published May 14, 2009 oracle 18 Comments

BLOG UPDATE ( 19-JUN-2009): I need to point out that ML 759565.1 has been significantly revised. The message regarding test before enabling NUMA persists. Not that it matters much, I concur with that advice. The original post follows:

Cart Before Horse?
Yes, in this mini-series of posts Part II will precede Part I. I’ll explain…eventually.

In the comment thread of my recent blog post entitled Oracle Database 11g Automatic Memory Management – Part IV. Don’t Use PRE_PAGE_SGA, OK?, a reader asked if I’d comment on Metalink note 759565.1 (759565.1). The reader’s specific interest was in the late breaking stance Oracle Support has taken regarding software NUMA optimizations. Just what is that support position? Run with all NUMA software optimizations disabled. Full stop.

I have been in contact with the Support Engineer that “owns” ML 759565.1 and have given some advice on how to change that note. I’ve been informed that a re-write is underway and that I will be on the review for that revision. That’s good, but I still think this topic is worthy of your time (feeling especially presumptuous today I guess).

The current rendition of the support note reads:

Oracle Support recommends turning off NUMA at the OS level or at the database level. If NUMA is disabled at the OS level then it will also be disabled at Oracle level. It does not matter how NUMA is disabled as long as it is disabled either at the OS layer or the Database layer.

What’s that? Run a NUMA system with all software optimizations disabled? Gasp! But wait, given how much I’ve rambled on about NUMA and NUMA software optimization principles over the years you will surely be flabbergasted to find out that, in principle, I agree with this stance.

What Is It Exactly That I Agree With?
Herein lies the rub. I have to offer a lengthy diatribe about NUMA in order to explain what I agree with. Not all NUMA systems are created equal. NUMA system fall into one three camps:

Pioneer, Proprietary NUMA Implementations (PPNI).
Modern, Proprietary NUMA Implementations (MPNI).
Commodity NUMA Implementations (CNI).

Different NUMA Implementations == Differences!
Let’s see if I can make some sense of these differences. And, trust me, this does relate to ML 759565.1.

1) Pioneer, Proprietary NUMA Implementations (PPNI). The first commercial cache-coherent NUMA system was the Sequent NUMAQ-2000. Within a couple of years of that hardware release there were several other pioneer implementations brought to market by DG, DEC, SGI, Unisys and others. The implementation details of these pioneer NUMA systems varied hugely (e.g., interconnect technology, levels of OS NUMA awareness, etc). One thing these pioneer implementations all shared in common was the fact that they suffered huge ratios between local and remote memory latency. When I say huge, I’m talking as much as 50 to 1 for highly contended multiple-hop remote memory. The only reason these pioneer systems were brought to market was because they offered tremendous advancements in system bandwidth. The cost, however, was lumpy memory and thus software NUMA-awareness was of utmost importance. I would consider systems like the Sun E25K to be “second-generation” pioneer systems. Sure, the E25K suffered memory locality costs, but not as badly as the true pioneer systems. Few would argue that even the “second-generation” pioneer systems relied heavily on software NUMA-awareness.

2) Modern, Proprietary NUMA Implementations (MPNI). I’m not going to cite many systems here as cases in point. I don’t aim to wound the tender sensibilities of any hardware supplier. I can define what I mean by MPNI by simply stating that MPNI systems differ from PPNI in terms of remote to local memory latency ratios. In short, MPNI systems have very favorable L:R latency ratios. By very favorable, I mean significantly less than 2 to 1. An example of an MPNI system would be the Sun SPARC Enterprise M9000 Server which, according to my good friend Glenn Fawcett, sports an approximate local to remote latency ratio of 1.3:1. In my opinion, it is not worth the complexities necessary to do proper, effective software NUMA awareness when there is only 30% disparity between the local and remote memory (at least not Oracle NUMA-awareness). Now, having said that, I know the M9000 supports scaling to multiple cabinets. I don’t know enough about the crossbar (Juniper Interconnect) to say whether it requires any “hop” overhead in a multiple-cabinet configuration. Sun literature states point-to-point without caveats so the L:R ratio might remain constant as one adds cabinets. Nonetheless, the point being made here is that there exist today modern, proprietary NUMA implementations and concerns over Oracle NUMA awareness should be weighed according to MPNI capabilities—not arcane, PPNI capabilities.

3) Commodity NUMA Implementations (CNI). I don’t feel compelled to hide my exuberance for modern NUMA implementations such as Intel QuickPath Interconnect and the HyperTransport (HT) used by AMD. The points I want to make about CNI are as follows:

Memory Latency Ratios. While I’ve not stayed as up to speed on local-remote ratios with HT 3.0, I know that the Intel QPI-based systems offer very pleasant L:R ratios (e.g., 1.4:1 or better). More importantly, I should point out that even remote memory references in Nehalem-based Servers (Xeon 5500) are faster than all memory references in the previous generation Xeon-based systems (e.g., “Harpertown” Xeon 5400)!
BIOS-Enabled NUMA. Commodity NUMA systems support the concept of boot-time NUMA enablement. When booted with NUMA disabled at the BIOS, the resultant memory architecture is commonly referred to as Sufficiently Uniform Memory Access (SUMA) or Sufficiently Uniform Memory Organization (SUMO).

Does All This Really Relate to ML 759565.1?
Yes. While I haven’t seen the re-write of that note, I’ll say what I think it needs to say:

Regarding Commodity NUMA Implementations.
1. If you are running Oracle on a CNI system you should test before you even bother enabling NUMA in the BIOS—not the other way around. If you deploy a CNI system, such as a two-socket Intel Xeon 5500 (Nehalem) server to run Oracle, I assert that you would have to do significant testing to find much of a performance difference between disabling NUMA in the BIOS and a fully-NUMA-aware configuration (i.e., NUMA BIOS=on + OS NUMA + on Oracle enable_NUMA_optimization=TRUE). That is, you will likely have to go to significant effort to find a performance delta in the range of anything greater than about 10%. It will be extremely workload dependent (in reality I’m aware of test results that show 10% improvement with NUMA disabled in the BIOS). Having said that, I’m not sure a 10% improvement would be worth the Linux-specific issues associated with setting enable_NUMA_optimization to TRUE. I’ll gladly take my medicine from anyone that can show me more than, say, 10% with all the bells and whistles (and associated thorns and barbs) of a fully NUMA-aware Oracle Database 11g deployment. Remember, I’m speaking specifically about CNI systems and let’s keep track of the publish date of this blog entry because the number of sockets can bear significance on this topic as I’ll point out later in this series. When I talk about a SUMA/SUMO approach to Oracle on CNI, I mean single-hop memory, which should be the case up to at least 4 sockets, but I don’t know since QPI systems are limited to two sockets today.
  1. What are the “thorns and barbs” I allude to? Well, it’s all about OS NUMA sophistication with specific regard to when, or if, to remotely execute a process, when an imbalance warrants a process migration and how often such things should occur. The best NUMA-aware Operating System ever developed (DYNIX/ptx) had these sorts of issues iron out flat. We (Sequent) had these sorts of issues directly in our fully focused rifle scopes and we weren’t messing around. Who is scrutinizing such issues in today’s world with CNI systems? After all, the “thorns and barbs” I allude to are dealt with in the Operating System. Do you know many Linux guys that are scrutinizing low-level systems performance issues specific to Oracle on CNI with bus traces and all? That’s not exactly a saturated research field 🙂 There is a reason the term SUMA/SUMO exists!

Regarding Modern Proprietary NUMA Implementations.
1. If you are running Oracle on a MPNI system, take heed to Oracle Support Services’ recommendation to run with Oracle NUMA optimizations disabled (enable_NUMA_optimization = FALSE) as is the default with patch 8199533 applied. Based on testing and analysis, it may make sense to enable NUMA awareness in Oracle. That will depend on the workload and the degree of non-uniformity of your MPNI system. Talk of disabling NUMA in the OS will likely not relate to your platform. More on this latter in this post. Talk of disabling NUMA at the hardware level (akin to the BIOS CNI approach) is most likely irrelevant.
Regarding Pioneer Proprietary NUMA Implementation.
1. If you are running Oracle on a PPNI system, I’d like to buy you a beer! That would be a pretty old system with very old software. But, vintage notwithstanding, it would likely be a pretty stable system.

That Middle Ground: MPNI with enable_NUMA_optimization = FALSE
It’s true. Running Oracle Database with its NUMA-awareness disabled on a MPNI system is a sort of middle ground. It so happens that this particular middle ground makes a lot of sense—and you likely have no choice either way. After all, I know of no MPNI system that allows you to run in a hardware-SUMA mode. It would be a bit of a chore to interleave memory across motherboards.

The Operating Systems that support MPNI systems generally interleave the memory allocations that back IPC shared memory and mmap()s. That is a good thing since the result is a fairness of access to the shared resource. More importantly, however, on really large NUMA systems (as most MPNI are) is the fact that private memory allocations (e.g., stack and heap) are allocated from the memory local to the NUMA node the process is executing on. Likewise, kernel structures affiliated with the process are allocated from local memory. So, the blend is a good thing on these types of systems. Having said that, however, I cannot vouch for what happens when there is CPU saturation on a NUMA node and the process scheduler is faced with a decision to remotely execute a process or just let idle cycles blow by. Moreover, I cannot vouch for these Operating Systems’ intelligence regarding when to just migrate a process over to a less-saturated NUMA node re-homing all the physical memory backing the process. Getting that stuff right is pretty hard work. I’ve got the T-Shirt to prove it.

I’m going to call this Part II and follow up with Part I soon. When you read Part II that move will make sense.

18 Responses to “You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II.”

Feed for this Entry Trackback Address

1 Noons May 14, 2009 at 11:47 pm

Thanks for the excellent post, Kevin. Very timely for me as well.
One point I’d appreciate your comment, if possible.

Given that in the CNI environment the R memory speed is usually highly variable – dependent on the selected memory chips and their maker – the L:R ratio would vary greatly.

As in: the L speed is more or less fixed by the CPU/board maker, while the R speed is basically dependent on mr Kingston – or equivalent.

What would be the best way of testing the L:R ratio? Is there such a thing as an Oracle NUMA L:R test harness one could pick up and run with, similar to the Orion tool for I/O? Or just a plain old memory speed test, of which there are quite a few around the net?

Thanks in advance for any feedback.

Reply
- 2 kevinclosson May 15, 2009 at 5:33 pm
  
  As always, Noons, thanks for stopping by….
  
  I’m not sure I fully understand your question, but let me give it a try. Tell me if I’m off base.
  
  Let me say in advance that I do not have a full suite of test results to back up what I’m currently saying. I do know, however, that in my HT 2.0 tests the L:R ratio remained constant as I switched DIMM speeds. I was not surprised by that testing because, after all, with CNI it is totally QPI or HT that varies the L:R ratio. Remember that CNI offerings are single mother board offerings. This, of course, presumes all memory in all nodes is of equal speed.
  
  Once memory “hops” come into play the L:R ratio opens up. How far it opens up depends on how good the NUMA bus is (e.g., QPI or HT 3.0).
  
  Remember, by the way, that thus far we are only talking about the “feel” of memory for a process executing on one node and accessing memory from another node. We haven’t (yet) started discussing what happens to the “effect” L:R when there is runtime, cross-node demand placed on a line. That’s where it gets really fun.
  
  Reply
3 Glenn Fawcett May 15, 2009 at 8:19 pm

There is a benchmark tool called lmbench that we have used in the past to determine memory latency. I don’t currently have a version to test remote latencies, but the source is out on the web.

Reply
- 4 kevinclosson May 15, 2009 at 9:08 pm
  
  Glenn, surely you place the M9000 kit into the MPNI category as opposed to something like PPNI++ (ala the e25K), right?
  
  Reply
5 George May 16, 2009 at 5:48 pm

Hi Kevin

Well it took me to the end of the blog to make sense of everything, but got to say one of the most interesting blog entries yet.

any idea of the L:R values beteween say a E10K, E15K 20, 25 and then the M9000.

how have the Intel & AMD architectures changed / improved or not over the years.

Where does the Itanium products of HP sit, are there big differences in the numbers of say a RX6600 compared to the Superdone?
G

Reply
- 6 kevinclosson May 16, 2009 at 10:16 pm
  
  Thanks George…so you’ve all all my posts and grade this one that high ? 🙂
  
  Joking aside. I don’t have many L:R ratios to cite. I ahd Proliant DL585 4-socket Opteron on HT2.0 committed to memory, but in fairness I won’t discuss it since HT 3.0 is current technology. I’m told by my good friend Glenn Fawcett that the M9000 is 1.3:1. Thus I use it as a great case-in-point for defining MPNI.
  
  Itanium? No comment, er, uh, well ok… it would be fun to see what it does with QPI (Tukwila) and quad-core.
  
  Reply
- 7 DuncanE May 17, 2009 at 2:55 pm
  
  Kevin,
  
  A thousand thanks for writing this particular blog post – it clears up a lot of the confusion that seemed to be in the metalink article – I’m looking forward to seeing the re-written version.
  
  George, since you asked about the HP/Itanium servers…
  
  rx6600 isn’t a NUMA server – so no comment on that one.
  
  For Superdome which I guess is the most similar to M9000, you can see latency figures on p11 in this doc:
  
  Click to access 5982-9836EN.pdf
  
  Superdomes can have > 2 latency domains, so that ratio is going to vary from appox 1.3:1 for 16 socket/32 core systems, right up to 2.1:1 for 64-socket/128 core systems, with some other points in-between. Of course Superdome technology is pretty old now, and as Kevin said, it will be interesting to see where it goes with Tukwila and QPI.
  
  Duncan
  
  Reply
  - 8 kevinclosson May 17, 2009 at 10:18 pm
    
    Duncan,
    
    You’re welcome… I’ll add that Superdome is based on a server that I held in highest regard in the late 90s…the Convex Exemplar -based V Series. Those were some really cool systems.
    
    Reply
9 mbobak July 9, 2010 at 10:32 am

Hi Kevin,

Guess I’m a little late to this subject. I’m currently running RHEL5.3 w/ kernel 2.6.18-128.el5 with 11.2.0.1 running in a 4-node RAC on HP DL-360 with 2 socket Quad-core “Intel(R) Xeon(R) CPU X5570 @ 2.93GHz”.

I assume this is an example of a system that falls into the CNI category?

Looking at:
[oracle@msrac201 ~]$ ls /sys/devices/system/node/*
/sys/devices/system/node/node0:
cpu0 cpu12 cpu2 cpu6 cpumap meminfo
cpu10 cpu14 cpu4 cpu8 distance numastat

/sys/devices/system/node/node1:
cpu1 cpu13 cpu3 cpu7 cpumap meminfo
cpu11 cpu15 cpu5 cpu9 distance numastat

This means that NUMA is currently enabled, correct?

Looking at that MOS note you mention above, it appears to have not been updated for 11.2.

So, I guess my question is, what about 11.2? Should I have NUMA enabled? The way I read it, it sounds like I shouldn’t bother. But in my case, it’s already enabled, and, so far at least, I haven’t run into any problems.

So, should I bother disabling it, if there doesn’t appear to be any compelling reason to do so? Any other thoughts or ideas with respect to my particular configuration?

Thanks!

-Mark

Reply
10 mbobak July 9, 2010 at 10:44 am

Oh, and if it matters, I have HugePages enabled, and I’m not using automatic memory management.

-Mark

Reply
11 mbobak July 9, 2010 at 10:52 am

Sorry to keep spamming you, but, I just discovered MOS 864633.1 “Enable Oracle NUMA support with Oracle Server Version 11.2.0.1”.

Thoughts?

-Mark

Reply
- 12 mbobak July 9, 2010 at 1:54 pm
  
  Argh, ok, just found Part I of this post. Nevermind…. 🙂
  
  Reply

	David Zheng on Announcing pgio (The SLOB Meth…
	Oracle redo log perf… on File Systems For A Database? C…
	Oracle redo log perf… on Yes, File Systems Still Need T…
	kevinclosson on Announcing SLOB 2.5.4
	pgio nutzen? - I/O W… on So pgio Does Not Accurately Re…

Kevin Closson's Blog: Platforms, Databases and Storage