Intel Xeon 5500 (Nehalem EP) NUMA Versus Interleaved Memory (aka SUMA): There Is No Difference! A Forced Confession.

Published August 14, 2009 oracle 12 Comments

I received an interesting email recently from a reader that takes offense at how I dare to discuss the differences between Intel Xeon 5500 (Nehalem) systems operating in NUMA versus SUMA/SUMO mode. One excerpt of the email read:

…and I think you are just creating confusion and chaos to gain popularity with your NUMA versus non-NUMA stuff. We tested everything we can think of and see no difference when booted with NUMA or non-NUMA…

I don’t doubt for one moment that the testing performed by this reader showed no performance differences between NUMA and SUMA because I have no idea whatsoever what his testing consisted of. And, besides, Xeon 5500 Nehalem EP is one extremely nice NUMA package. That is, when running non-NUMA aware software on this particular NUMA offering you can rest assured that you won’t likely fall over dead from NUMA pathologies. That’s good, but does that mean there really is no difference when booted in the NUMA versus SUMA? Hardly!

Please allow me to explain something. Intel Xeon 5500 (Nehalem) is a very tightly coupled NUMA system. Remote memory references are only about 20% more costly than local. If you measure a workload that does not saturate the processors you are very unlikely to detect any difference in throughput. If you have a program that only drives a processor core to, say, 80% utilization you will most likely not see any throughput difference if the process performs all its I/O into remote memory or local memory. When using only remote memory the process would consume moderately more processor cycles, however unless the code is overly-synthetic so as to force a high rate of L2 misses the result would likely be equivalent throughput in both the local and remote cases.

NUMA/SUMA: The Ever-Hypothetical Topic
Let’s stop talking in the hypothetical. How about something that, gasp, real Oracle Database Administrators have to do more than just occasionally. Consider for a moment transferring a sizable zipped ASCII file in preparation for loading into an Oracle Data Warehouse. When booting in the default NUMA mode and running Linux, memory is presented to processes in multiple hierarchies. For example, the following box shows a freshly booted Intel Xeon 5500 (Nehalem EP) box with 16 GB total RAM segmented into two memories. Notice how just 7 minutes after booting up memory has been consumed in a non-symmetrical fashion. The numactl command shows that roughly 40% more memory has been allocated from node 0 memory compared to node 1. That’s because not every memory usage in the Linux kernel (including drivers) is NUMA aware. But that is not what I’m blogging about.

# uptime;numactl --hardware
 13:28:30 up 7 min,  1 user,  load average: 0.00, 0.09, 0.07
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 5773 MB
node 1 size: 8080 MB
node 1 free: 7955 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
# cat /proc/meminfo
MemTotal:     16427752 kB
MemFree:      14059424 kB
Buffers:         19588 kB
Cached:         239480 kB
SwapCached:          0 kB
Active:          66308 kB
Inactive:       217152 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     16427752 kB
LowFree:      14059424 kB
SwapTotal:     2097016 kB
SwapFree:      2097016 kB
Dirty:            1848 kB
Writeback:           0 kB
AnonPages:       24408 kB
Mapped:          15024 kB
Slab:           170920 kB
PageTables:       3512 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:  10310892 kB
Committed_AS:   382752 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    381716 kB
VmallocChunk: 34359356623 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB
# free
             total       used       free     shared    buffers     cached
Mem:      16427752    2370064   14057688          0      19872     239852
-/+ buffers/cache:    2110340   14317412
Swap:      2097016          0    2097016

In this section of this blog entry I’d like to show a practical example of honest-to-goodness, real world work that doesn’t exhibit totally benign NUMA characteristics. Within a VNC I opened two xterm sessions. I’ll call them “left” and “right.” In the left xterm I’ll list a zipped ASCII file to capture the inode so as to prove my testing is happening against the same file. The file is inode 1701506. You’ll also see a stupid little script called henny_penny.sh named appropriately as I apparently come off as Henny Penny to folks like the reader who emailed me. The henny_penny.sh script executed in the left xterm showed that a shell with a parent process id of 23283 was able to sling the contents of all_card_trans.ul.gz into /dev/null at the rate of 4.9 GB/s. That is very fast indeed. It is that fast, in fact, because the file has been moved into the current directory with FTP so the contents of the approximately 1.5 GB file is cached in memory. Ah, but the question is, what memory?

# ls -li all* henny_penny*
1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz
1701513 -rwxr-xr-x 1 root root         90 Aug 14 12:17 henny_penny.sh
# cat henny_penny.sh
ps -f
ls -li all_card_trans.ul.gz
date
dd if=all_card_trans.ul.gz of=/dev/null bs=1M
date
# sh ./henny_penny.sh
UID        PID  PPID  C STIME TTY          TIME CMD
root     23283 23280  0 12:13 pts/0    00:00:00 -bash
root     23849 23283  0 12:18 pts/0    00:00:00 sh ./henny_penny.sh
root     23850 23849  0 12:18 pts/0    00:00:00 ps -f
1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz
Fri Aug 14 12:18:12 PDT 2009
1403+1 records in
1403+1 records out
1472114768 bytes (1.5 GB) copied, 0.30021 seconds, 4.9 GB/s
Fri Aug 14 12:18:12 PDT 2009

In the following box you’ll see how things behaved in the right xterm. I invoked henny_penny.sh (parent PID 23422) and voila dd(1) was able to shovel the contents of all_card_trans.ul.gz into /dev/null at a rate of 6 GB/s. Now, that’s only 22% faster for a totally memory-bound, CPU-saturated task so why would anyone other than Henny Penny care? Notice how the henny_penny.sh script included the output of the date(1) command. Just three seconds after “left” was muddling through at 4.9 GB/s, “right” proceeded to rip through at 6.0 GB/s. Yes, memory hierarchy matters.

# sh ./henny_penny.sh
UID        PID  PPID  C STIME TTY          TIME CMD
root     23422 23420  0 12:14 pts/3    00:00:00 -bash
root     23856 23422  0 12:18 pts/3    00:00:00 sh ./henny_penny.sh
root     23857 23856  0 12:18 pts/3    00:00:00 ps -f
1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz
Fri Aug 14 12:18:15 PDT 2009
1403+1 records in
1403+1 records out
1472114768 bytes (1.5 GB) copied, 0.244703 seconds, 6.0 GB/s
Fri Aug 14 12:18:15 PDT 2009

How, What, Why?
The left xterm and its children happen to be executing on cores 0-3 (SMT disabled at the moment but no matter) and the right xterm on cores 4-7. The FTP process executed on one (or more) of cores 4-7 and since Linux prefers to allocate buffers to a process such as this from local memory, you can see why henny_penny.sh in the right xterm achieved the throughput it did.

Who Cares?
Likely nobody until the Xeon 5500 Linux production uptake actually starts! In the meantime there is me (Henny Penny) and a few curiously morbid (er, uh, morbidly curious) Googlers who might stumble upon this trivia.

What’s This Have To Do With Nehalem EX?
Well, even the 4-socket Nehalem EX packaging implements single-hop remote memory. That’s a significant difference from the way 4-sockets were done with HyperTransport. So, I actually don’t expect NUMAisms such as this to be any more painful than with EP (2 socket).

I Still Think He’s Henny Penny
So, let’s take another look at this topic. I’ve already mentioned that Linux likes to allocate memory close to processes when running on Nehalem systems. That’s good, isn’t it? Well, the answer is yes, of course, it depends.

In the following text box you’ll see how I depleted free memory (down to 40MB free) from node 0 by writing zeros to a file. Consider yet another hypothetical with me for one moment. What happens when I execute, say, 100 processes that each allocates a moderate 16 MB of memory with malloc(3)? Do you think Linux will yank these processes from me, their parent, and place them on node 1 or will they be homed on node 0 with their heaps allocated from node 1? Will it matter? What if they are producers and I am their consumer? Where should they execute? What if they each work on 1/100^th of the dumb_test.out file reading into their respective heap? Well, at this point there is no way for 100 processes on node 0 (socket 0) to attack 1/100^th segments (buffering in their heap) of that file without 100% remote memory overhead. Could such a “bizarre” hypothetical happen in production? Sure. Is there any way to properly deal with such an issue? Well, yes and no.

If the hypothetical “1/100^th program” was coded to libnuma then it can assure process placement and therefore local heap. However, what about the fact that my work file is buffered entirely on node 0 memory? Wouldn’t that guarantee 100% local access to node 0 users of that file but 100% remote for node 1 users? Yes. That’s great for the node 0 users you might say. However, those node 0 users had better not malloc(3) any memory because you know where that memory is going to come from. ‘Round and ’round we go…

# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 5946 MB
node 1 size: 8080 MB
node 1 free: 7987 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
# time dd if=/dev/zero of=dumb_test.out bs=1M count=5946;numactl --hardware
5946+0 records in
5946+0 records out
6234832896 bytes (6.2 GB) copied, 6.07315 seconds, 1.0 GB/s

real    0m6.091s
user    0m0.003s
sys     0m6.069s
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 40 MB
node 1 size: 8080 MB
node 1 free: 7652 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

So, what if I cloak my test with libnuma attributes (inherited by dd from numactl(8))? In the following text box you’ll see that instead of a Cyclops, memory was allocated nice and evenly from the page cache when I filled out the dumb.test.out file. So in this model, processes homed on either node 0 or node 1 are guaranteed a 50% local access rate when accessing dumb_test.out and I am protected from memory imbalances. In fact, if it was my system and had to stay with NUMA, I’d consider invoking shells under numactl –interleave. As such any non-NUMA aware programs (like FTP) will be granted memory in a round-robin fashion but any NUMA aware program (coded to libnuma calls) will execute as it would without being wrapped with numactl. It’s just a thought. It isn’t any official recommendation and, as my email in-box suggests, it doesn’t matter anyway…nonetheless, I think the following looks better than a cyclops:

# numactl --interleave=0,1 /bin/bash
# numactl -s
policy: interleave
preferred node: 0 (interleave next)
interleavemask: 0 1
interleavenode: 0
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1
# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 5957 MB
node 1 size: 8080 MB
node 1 free: 7988 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10
# dd if=/dev/zero of=dumb_test.out bs=1M count=5957
5957+0 records in
5957+0 records out
6246367232 bytes (6.2 GB) copied, 6.24962 seconds, 999 MB/s
# numactl --hardware
available: 2 nodes (0-1)
node 0 size: 8052 MB
node 0 free: 2825 MB
node 1 size: 8080 MB
node 1 free: 4854 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

12 Responses to “Intel Xeon 5500 (Nehalem EP) NUMA Versus Interleaved Memory (aka SUMA): There Is No Difference! A Forced Confession.”

Feed for this Entry Trackback Address

1 George August 19, 2009 at 4:17 pm

hmm

so I assume I’m one of those morbidly curious readers. 🙂
Ok so just as I was starting to fit things together the posting ended.

So we know the last output looks better, for us mere mortals, I assume this all depends on the application vendor to actually use the more improved memory allocation logic. as as to split memory usgae from 100% local toa 50/50 local/remote allocation by using th correct calls ?
G

Reply
- 2 kevinclosson August 19, 2009 at 6:03 pm
  
  well, the “application” in this case is a simple dd..as as the following shows it doesn’t link (declaratively) with libnuma…so it can’t possibly chew up memory in a more distributed fashion even if it wanted to…and, no, it doesn’t link with libnuma using a dlopen() either…so it is totally non-NUMA aware.
  
  $ type dd
  dd is /bin/dd
  $ ldd /bin/dd
  librt.so.1 => /lib64/librt.so.1 (0x00000035c3200000)
  libc.so.6 => /lib64/libc.so.6 (0x00000035c1e00000)
  libpthread.so.0 => /lib64/libpthread.so.0 (0x00000035c2a00000)
  /lib64/ld-linux-x86-64.so.2 (0x00000035c1a00000)
  
  …same for scp:
  
  $ type scp
  scp is /usr/bin/scp
  $ ldd /usr/bin/scp
  libcrypto.so.6 => /lib64/libcrypto.so.6 (0x00002aafe1bfa000)
  libutil.so.1 => /lib64/libutil.so.1 (0x00002aafe1f4b000)
  libz.so.1 => /usr/lib64/libz.so.1 (0x00002aafe214e000)
  libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aafe2363000)
  libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00002aafe257b000)
  libresolv.so.2 => /lib64/libresolv.so.2 (0x00002aafe27b3000)
  libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x00002aafe29c9000)
  libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00002aafe2bf7000)
  libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x00002aafe2e8c000)
  libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00002aafe30b2000)
  libnss3.so => /usr/lib64/libnss3.so (0x00002aafe32b4000)
  libc.so.6 => /lib64/libc.so.6 (0x00002aafe3605000)
  libdl.so.2 => /lib64/libdl.so.2 (0x00002aafe395c000)
  libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x00002aafe3b60000)
  libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00002aafe3d68000)
  libnssutil3.so => /usr/lib64/libnssutil3.so (0x00002aafe3f6b000)
  libplc4.so => /usr/lib64/libplc4.so (0x00002aafe4187000)
  libplds4.so => /usr/lib64/libplds4.so (0x00002aafe438b000)
  libnspr4.so => /usr/lib64/libnspr4.so (0x00002aafe458f000)
  libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aafe47c9000)
  /lib64/ld-linux-x86-64.so.2 (0x00002aafe19dd000)
  libselinux.so.1 => /lib64/libselinux.so.1 (0x00002aafe49e4000)
  libsepol.so.1 => /lib64/libsepol.so.1 (0x00002aafe4bfd000)
  
  So…if you want such tools to consume memory in a manner that cannot create an imbalance you have to wrap your shell in numactl –interleave so that the memory placement attributes gets inherited by such children as dd, tar, scp, ftp, etc … anything that uses significant file system buffer cache…
  
  Reply
3 George August 19, 2009 at 6:30 pm

Hi Kevin,

I was not really looking at tools like dd, ftp etc, I was more thinking to get maximum benefit out of all this nice potential performance it basically implies the vendor writing a application got to code his application using the proper memory calls so that the required memory is distributed. Instead of just asking for memory the old way.

As you said, app written with Numa concepts in mind.
G

Reply
- 4 kevinclosson August 19, 2009 at 6:47 pm
  
  Software optimization for NUMA awareness has been on the minds of a very small, select group of industry “outsiders” since the mid 90s…. Then AMD came out with Opteron bringing NUMA to the commodity market…then Intel came out with QPI…
  
  The Sequent diaspora sent some, uh, really NUMA-knowledgeable folks to the 4 corners of the earth… there, they (we) struggled to find ears willing to hear of the importance of correct NUMA software awareness…
  
  Software maturation is a never ending loop of pay now or pay later decisions.
  
  It’s my not-so-noble goal to disseminate some information that will make Oracle users who are adopting Nehalem a little more aware of how the clock works…nothing more than that…
  
  Reply
5 Bernd Eckenfels August 21, 2009 at 7:53 pm

First of all, the article is I think a bit cleaner if you skit the reference to xterm and ftp. How about “two shells” and then you drop the caches and then you dd the file 2 times in each shell (where only the first dd will suck it into (local) cache).

But another thing, I dont get the numa-interleave recommendation. As an application vendor who does not want to invest too much fine tuning on systems I would recommend to run 2 Instances (with Java in my case you dont want to have a too big heap anyway). and each on their own numa set. As long as thhey dont share files (buffers) and fit into local memory I dont need much numactl interleave magic, right?

Gruss
Bernd

Reply
- 6 kevinclosson August 24, 2009 at 8:18 pm
  
  Feedback welcome, but let me point out that an application that can be expressed as “2 Instances” is a cake-walk to deploy nice and neatly in a 2-node NUMA environment. This blog is much more Oracle Database minded than that…
  
  Reply
7 Bernd Eckenfels August 25, 2009 at 12:52 am

Well, sure, of you app or db dies not fit in a zone (or if your guest virtual machine spans multiple zones, then you have to deal with it, and it might be a good idea to just stay away from the architecture if you want to safe the additional work).

However if you cobsolidate workloads the more flexible (and perhaps cheaper) NUMA severs are less risky to chose.

Having said that this specific article was less about oracle and more about the basic characteristics of Nethalem.

Oh yes some Ora benchmarking will be interesting in the future. Intel asks a lot from their traditional “fast single thread, low concurrency, uniform access” minded software development partners (and Oracle might not be one of those).

Greetings
Bernd

Reply
- 8 kevinclosson August 25, 2009 at 8:45 pm
  
  “Having said that this specific article was less about oracle and more about the basic characteristics of Nethalem.”
  
  Not necessarily so. I speak of what real life on NUMA means when you are an Oracle shop. All Oracle servers sustain an amount of “non-Oracle” processing. By “non-Oracle” I mean such administrative tasks as file transfers and system file backups and so forth. Oracle DBAs like to know about what sort of interesting things can happen on their Oracle servers when doing such things. If the article is not interesting to you because you focus more on a Java-based application then I understand your feedback.
  
  Reply
9 George November 20, 2009 at 7:47 am

As a Oracle DBA, oh hold it that use to be what I did, now suppose to be part of pre sales, or is that support, or consulting, very confusing at times,

We DBA minded guys are generally very curious about anything and everything that run’s on our servers and the impact it might have, so ye Oracle load and non Oracle load.

G

Reply
10 Zerrial Bass October 14, 2010 at 5:59 pm

Is anyone aware of any benchmarks using the flock command in a NUMA environment?

Reply
11 Bernd October 14, 2010 at 11:04 pm

flock? Dont think a serialized/non-parallel or IO- or sleep-heavy benchmark has any numa relevance. Only cache bounces might be a possibe stress effect of flock, but thats a general multi-cpu problem, nor particular numa problematic. Maybe you want to elaborate what you expect or why you see a need for that?

Bernd

Reply
- 12 kevinclosson October 14, 2010 at 11:26 pm
  
  Bernd,
  
  I agree.
  
  Reply

	David Zheng on Announcing pgio (The SLOB Meth…
	Oracle redo log perf… on File Systems For A Database? C…
	Oracle redo log perf… on Yes, File Systems Still Need T…
	kevinclosson on Announcing SLOB 2.5.4
	pgio nutzen? - I/O W… on So pgio Does Not Accurately Re…

Kevin Closson's Blog: Platforms, Databases and Storage