I received an interesting email recently from a reader that takes offense at how I dare to discuss the differences between Intel Xeon 5500 (Nehalem) systems operating in NUMA versus SUMA/SUMO mode. One excerpt of the email read:
…and I think you are just creating confusion and chaos to gain popularity with your NUMA versus non-NUMA stuff. We tested everything we can think of and see no difference when booted with NUMA or non-NUMA…
I don’t doubt for one moment that the testing performed by this reader showed no performance differences between NUMA and SUMA because I have no idea whatsoever what his testing consisted of. And, besides, Xeon 5500 Nehalem EP is one extremely nice NUMA package. That is, when running non-NUMA aware software on this particular NUMA offering you can rest assured that you won’t likely fall over dead from NUMA pathologies. That’s good, but does that mean there really is no difference when booted in the NUMA versus SUMA? Hardly!
Please allow me to explain something. Intel Xeon 5500 (Nehalem) is a very tightly coupled NUMA system. Remote memory references are only about 20% more costly than local. If you measure a workload that does not saturate the processors you are very unlikely to detect any difference in throughput. If you have a program that only drives a processor core to, say, 80% utilization you will most likely not see any throughput difference if the process performs all its I/O into remote memory or local memory. When using only remote memory the process would consume moderately more processor cycles, however unless the code is overly-synthetic so as to force a high rate of L2 misses the result would likely be equivalent throughput in both the local and remote cases.
NUMA/SUMA: The Ever-Hypothetical Topic
Let’s stop talking in the hypothetical. How about something that, gasp, real Oracle Database Administrators have to do more than just occasionally. Consider for a moment transferring a sizable zipped ASCII file in preparation for loading into an Oracle Data Warehouse. When booting in the default NUMA mode and running Linux, memory is presented to processes in multiple hierarchies. For example, the following box shows a freshly booted Intel Xeon 5500 (Nehalem EP) box with 16 GB total RAM segmented into two memories. Notice how just 7 minutes after booting up memory has been consumed in a non-symmetrical fashion. The numactl command shows that roughly 40% more memory has been allocated from node 0 memory compared to node 1. That’s because not every memory usage in the Linux kernel (including drivers) is NUMA aware. But that is not what I’m blogging about.
# uptime;numactl --hardware 13:28:30 up 7 min, 1 user, load average: 0.00, 0.09, 0.07 available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 5773 MB node 1 size: 8080 MB node 1 free: 7955 MB node distances: node 0 1 0: 10 20 1: 20 10 # cat /proc/meminfo MemTotal: 16427752 kB MemFree: 14059424 kB Buffers: 19588 kB Cached: 239480 kB SwapCached: 0 kB Active: 66308 kB Inactive: 217152 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 16427752 kB LowFree: 14059424 kB SwapTotal: 2097016 kB SwapFree: 2097016 kB Dirty: 1848 kB Writeback: 0 kB AnonPages: 24408 kB Mapped: 15024 kB Slab: 170920 kB PageTables: 3512 kB NFS_Unstable: 0 kB Bounce: 0 kB CommitLimit: 10310892 kB Committed_AS: 382752 kB VmallocTotal: 34359738367 kB VmallocUsed: 381716 kB VmallocChunk: 34359356623 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 Hugepagesize: 2048 kB # free total used free shared buffers cached Mem: 16427752 2370064 14057688 0 19872 239852 -/+ buffers/cache: 2110340 14317412 Swap: 2097016 0 2097016
In this section of this blog entry I’d like to show a practical example of honest-to-goodness, real world work that doesn’t exhibit totally benign NUMA characteristics. Within a VNC I opened two xterm sessions. I’ll call them “left” and “right.” In the left xterm I’ll list a zipped ASCII file to capture the inode so as to prove my testing is happening against the same file. The file is inode 1701506. You’ll also see a stupid little script called henny_penny.sh named appropriately as I apparently come off as Henny Penny to folks like the reader who emailed me. The henny_penny.sh script executed in the left xterm showed that a shell with a parent process id of 23283 was able to sling the contents of all_card_trans.ul.gz into /dev/null at the rate of 4.9 GB/s. That is very fast indeed. It is that fast, in fact, because the file has been moved into the current directory with FTP so the contents of the approximately 1.5 GB file is cached in memory. Ah, but the question is, what memory?
# ls -li all* henny_penny* 1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz 1701513 -rwxr-xr-x 1 root root 90 Aug 14 12:17 henny_penny.sh # cat henny_penny.sh ps -f ls -li all_card_trans.ul.gz date dd if=all_card_trans.ul.gz of=/dev/null bs=1M date # sh ./henny_penny.sh UID PID PPID C STIME TTY TIME CMD root 23283 23280 0 12:13 pts/0 00:00:00 -bash root 23849 23283 0 12:18 pts/0 00:00:00 sh ./henny_penny.sh root 23850 23849 0 12:18 pts/0 00:00:00 ps -f 1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz Fri Aug 14 12:18:12 PDT 2009 1403+1 records in 1403+1 records out 1472114768 bytes (1.5 GB) copied, 0.30021 seconds, 4.9 GB/s Fri Aug 14 12:18:12 PDT 2009
In the following box you’ll see how things behaved in the right xterm. I invoked henny_penny.sh (parent PID 23422) and voila dd(1) was able to shovel the contents of all_card_trans.ul.gz into /dev/null at a rate of 6 GB/s. Now, that’s only 22% faster for a totally memory-bound, CPU-saturated task so why would anyone other than Henny Penny care? Notice how the henny_penny.sh script included the output of the date(1) command. Just three seconds after “left” was muddling through at 4.9 GB/s, “right” proceeded to rip through at 6.0 GB/s. Yes, memory hierarchy matters.
# sh ./henny_penny.sh UID PID PPID C STIME TTY TIME CMD root 23422 23420 0 12:14 pts/3 00:00:00 -bash root 23856 23422 0 12:18 pts/3 00:00:00 sh ./henny_penny.sh root 23857 23856 0 12:18 pts/3 00:00:00 ps -f 1701506 -rw-r--r-- 1 root root 1472114768 Aug 14 11:31 all_card_trans.ul.gz Fri Aug 14 12:18:15 PDT 2009 1403+1 records in 1403+1 records out 1472114768 bytes (1.5 GB) copied, 0.244703 seconds, 6.0 GB/s Fri Aug 14 12:18:15 PDT 2009
How, What, Why?
The left xterm and its children happen to be executing on cores 0-3 (SMT disabled at the moment but no matter) and the right xterm on cores 4-7. The FTP process executed on one (or more) of cores 4-7 and since Linux prefers to allocate buffers to a process such as this from local memory, you can see why henny_penny.sh in the right xterm achieved the throughput it did.
Who Cares?
Likely nobody until the Xeon 5500 Linux production uptake actually starts! In the meantime there is me (Henny Penny) and a few curiously morbid (er, uh, morbidly curious) Googlers who might stumble upon this trivia.
What’s This Have To Do With Nehalem EX?
Well, even the 4-socket Nehalem EX packaging implements single-hop remote memory. That’s a significant difference from the way 4-sockets were done with HyperTransport. So, I actually don’t expect NUMAisms such as this to be any more painful than with EP (2 socket).
I Still Think He’s Henny Penny
So, let’s take another look at this topic. I’ve already mentioned that Linux likes to allocate memory close to processes when running on Nehalem systems. That’s good, isn’t it? Well, the answer is yes, of course, it depends.
In the following text box you’ll see how I depleted free memory (down to 40MB free) from node 0 by writing zeros to a file. Consider yet another hypothetical with me for one moment. What happens when I execute, say, 100 processes that each allocates a moderate 16 MB of memory with malloc(3)? Do you think Linux will yank these processes from me, their parent, and place them on node 1 or will they be homed on node 0 with their heaps allocated from node 1? Will it matter? What if they are producers and I am their consumer? Where should they execute? What if they each work on 1/100th of the dumb_test.out file reading into their respective heap? Well, at this point there is no way for 100 processes on node 0 (socket 0) to attack 1/100th segments (buffering in their heap) of that file without 100% remote memory overhead. Could such a “bizarre” hypothetical happen in production? Sure. Is there any way to properly deal with such an issue? Well, yes and no.
If the hypothetical “1/100th program” was coded to libnuma then it can assure process placement and therefore local heap. However, what about the fact that my work file is buffered entirely on node 0 memory? Wouldn’t that guarantee 100% local access to node 0 users of that file but 100% remote for node 1 users? Yes. That’s great for the node 0 users you might say. However, those node 0 users had better not malloc(3) any memory because you know where that memory is going to come from. ‘Round and ’round we go…
# numactl --hardware available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 5946 MB node 1 size: 8080 MB node 1 free: 7987 MB node distances: node 0 1 0: 10 20 1: 20 10 # time dd if=/dev/zero of=dumb_test.out bs=1M count=5946;numactl --hardware 5946+0 records in 5946+0 records out 6234832896 bytes (6.2 GB) copied, 6.07315 seconds, 1.0 GB/s real 0m6.091s user 0m0.003s sys 0m6.069s available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 40 MB node 1 size: 8080 MB node 1 free: 7652 MB node distances: node 0 1 0: 10 20 1: 20 10
So, what if I cloak my test with libnuma attributes (inherited by dd from numactl(8))? In the following text box you’ll see that instead of a Cyclops, memory was allocated nice and evenly from the page cache when I filled out the dumb.test.out file. So in this model, processes homed on either node 0 or node 1 are guaranteed a 50% local access rate when accessing dumb_test.out and I am protected from memory imbalances. In fact, if it was my system and had to stay with NUMA, I’d consider invoking shells under numactl –interleave. As such any non-NUMA aware programs (like FTP) will be granted memory in a round-robin fashion but any NUMA aware program (coded to libnuma calls) will execute as it would without being wrapped with numactl. It’s just a thought. It isn’t any official recommendation and, as my email in-box suggests, it doesn’t matter anyway…nonetheless, I think the following looks better than a cyclops:
# numactl --interleave=0,1 /bin/bash # numactl -s policy: interleave preferred node: 0 (interleave next) interleavemask: 0 1 interleavenode: 0 physcpubind: 0 1 2 3 4 5 6 7 cpubind: 0 1 nodebind: 0 1 membind: 0 1 # numactl --hardware available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 5957 MB node 1 size: 8080 MB node 1 free: 7988 MB node distances: node 0 1 0: 10 20 1: 20 10 # dd if=/dev/zero of=dumb_test.out bs=1M count=5957 5957+0 records in 5957+0 records out 6246367232 bytes (6.2 GB) copied, 6.24962 seconds, 999 MB/s # numactl --hardware available: 2 nodes (0-1) node 0 size: 8052 MB node 0 free: 2825 MB node 1 size: 8080 MB node 1 free: 4854 MB node distances: node 0 1 0: 10 20 1: 20 10
hmm
so I assume I’m one of those morbidly curious readers. 🙂
Ok so just as I was starting to fit things together the posting ended.
So we know the last output looks better, for us mere mortals, I assume this all depends on the application vendor to actually use the more improved memory allocation logic. as as to split memory usgae from 100% local toa 50/50 local/remote allocation by using th correct calls ?
G
well, the “application” in this case is a simple dd..as as the following shows it doesn’t link (declaratively) with libnuma…so it can’t possibly chew up memory in a more distributed fashion even if it wanted to…and, no, it doesn’t link with libnuma using a dlopen() either…so it is totally non-NUMA aware.
$ type dd
dd is /bin/dd
$ ldd /bin/dd
librt.so.1 => /lib64/librt.so.1 (0x00000035c3200000)
libc.so.6 => /lib64/libc.so.6 (0x00000035c1e00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000035c2a00000)
/lib64/ld-linux-x86-64.so.2 (0x00000035c1a00000)
…same for scp:
$ type scp
scp is /usr/bin/scp
$ ldd /usr/bin/scp
libcrypto.so.6 => /lib64/libcrypto.so.6 (0x00002aafe1bfa000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002aafe1f4b000)
libz.so.1 => /usr/lib64/libz.so.1 (0x00002aafe214e000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aafe2363000)
libcrypt.so.1 => /lib64/libcrypt.so.1 (0x00002aafe257b000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00002aafe27b3000)
libgssapi_krb5.so.2 => /usr/lib64/libgssapi_krb5.so.2 (0x00002aafe29c9000)
libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00002aafe2bf7000)
libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x00002aafe2e8c000)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00002aafe30b2000)
libnss3.so => /usr/lib64/libnss3.so (0x00002aafe32b4000)
libc.so.6 => /lib64/libc.so.6 (0x00002aafe3605000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aafe395c000)
libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x00002aafe3b60000)
libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00002aafe3d68000)
libnssutil3.so => /usr/lib64/libnssutil3.so (0x00002aafe3f6b000)
libplc4.so => /usr/lib64/libplc4.so (0x00002aafe4187000)
libplds4.so => /usr/lib64/libplds4.so (0x00002aafe438b000)
libnspr4.so => /usr/lib64/libnspr4.so (0x00002aafe458f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aafe47c9000)
/lib64/ld-linux-x86-64.so.2 (0x00002aafe19dd000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00002aafe49e4000)
libsepol.so.1 => /lib64/libsepol.so.1 (0x00002aafe4bfd000)
So…if you want such tools to consume memory in a manner that cannot create an imbalance you have to wrap your shell in numactl –interleave so that the memory placement attributes gets inherited by such children as dd, tar, scp, ftp, etc … anything that uses significant file system buffer cache…
Hi Kevin,
I was not really looking at tools like dd, ftp etc, I was more thinking to get maximum benefit out of all this nice potential performance it basically implies the vendor writing a application got to code his application using the proper memory calls so that the required memory is distributed. Instead of just asking for memory the old way.
As you said, app written with Numa concepts in mind.
G
Software optimization for NUMA awareness has been on the minds of a very small, select group of industry “outsiders” since the mid 90s…. Then AMD came out with Opteron bringing NUMA to the commodity market…then Intel came out with QPI…
The Sequent diaspora sent some, uh, really NUMA-knowledgeable folks to the 4 corners of the earth… there, they (we) struggled to find ears willing to hear of the importance of correct NUMA software awareness…
Software maturation is a never ending loop of pay now or pay later decisions.
It’s my not-so-noble goal to disseminate some information that will make Oracle users who are adopting Nehalem a little more aware of how the clock works…nothing more than that…
First of all, the article is I think a bit cleaner if you skit the reference to xterm and ftp. How about “two shells” and then you drop the caches and then you dd the file 2 times in each shell (where only the first dd will suck it into (local) cache).
But another thing, I dont get the numa-interleave recommendation. As an application vendor who does not want to invest too much fine tuning on systems I would recommend to run 2 Instances (with Java in my case you dont want to have a too big heap anyway). and each on their own numa set. As long as thhey dont share files (buffers) and fit into local memory I dont need much numactl interleave magic, right?
Gruss
Bernd
Feedback welcome, but let me point out that an application that can be expressed as “2 Instances” is a cake-walk to deploy nice and neatly in a 2-node NUMA environment. This blog is much more Oracle Database minded than that…
Well, sure, of you app or db dies not fit in a zone (or if your guest virtual machine spans multiple zones, then you have to deal with it, and it might be a good idea to just stay away from the architecture if you want to safe the additional work).
However if you cobsolidate workloads the more flexible (and perhaps cheaper) NUMA severs are less risky to chose.
Having said that this specific article was less about oracle and more about the basic characteristics of Nethalem.
Oh yes some Ora benchmarking will be interesting in the future. Intel asks a lot from their traditional “fast single thread, low concurrency, uniform access” minded software development partners (and Oracle might not be one of those).
Greetings
Bernd
Not necessarily so. I speak of what real life on NUMA means when you are an Oracle shop. All Oracle servers sustain an amount of “non-Oracle” processing. By “non-Oracle” I mean such administrative tasks as file transfers and system file backups and so forth. Oracle DBAs like to know about what sort of interesting things can happen on their Oracle servers when doing such things. If the article is not interesting to you because you focus more on a Java-based application then I understand your feedback.
As a Oracle DBA, oh hold it that use to be what I did, now suppose to be part of pre sales, or is that support, or consulting, very confusing at times,
We DBA minded guys are generally very curious about anything and everything that run’s on our servers and the impact it might have, so ye Oracle load and non Oracle load.
G
Is anyone aware of any benchmarks using the flock command in a NUMA environment?
flock? Dont think a serialized/non-parallel or IO- or sleep-heavy benchmark has any numa relevance. Only cache bounces might be a possibe stress effect of flock, but thats a general multi-cpu problem, nor particular numa problematic. Maybe you want to elaborate what you expect or why you see a need for that?
Bernd
Bernd,
I agree.