Now I’m glad I did that series about Oracle on Linux, The NUMA Angle. In my post about the the difference between NUMA and SUMA and “Cyclops”, I shared a lot of information about the dynamics of Oracle running with all the SGA allocated from one memory bank on a NUMA system. Déjà vu.
Well, we’re at it again. As I point out in Part I and Part II of this series, Oracle implements Automatic Memory Management in Oracle Database 11g with memory mapped files in /dev/shm. That got me curious.
Since I exclusively install my Oracle bits on NFS mounts, I thought I’d sling my 11g ORACLE_HOME over to a DL385 I have available in my lab setup. Oh boy am I going to miss that lab when I take on my new job September 4th. Sob, sob. See, when you install Oracle on NFS mounts, the installation is portable. I install 32bit Linux ports via 32bit server into an NFS mount and I can take it anywhere. In fact, since the database is on an NFS mount (HP EFS Clustered Gateway NAS) I can take ORACLE_HOME and the database mounts to any system with a RHEL4 OS running-and that includes RHEL4 x86_64 servers even though the ORACLE_HOME is 32bit. That works fine, except 32bit Oracle cannot use libaio on 64bit RHEL4 (unless you invokde everything under the linux32 command environment that is). I don’t care about that since I use either Oracle Disk Manager or, better yet, Oracle11g Direct NFS. Note, running 32bit Oracle on a 64bit Linux OS is not supported for production, but for my case it helps me check certain things out. That brings us back to /dev/shm on AMD Opteron (NUMA) systems. It turns out the only Opteron system I could test 11g AMM on happens to have x86_64 RHEL4 installed-but, again, no matter.
[root@tmr6s5 ~]# numactl --hardware available: 2 nodes (0-1) node 0 size: 5119 MB node 0 free: 3585 MB node 1 size: 4095 MB node 1 free: 3955 MB [root@tmr6s5 ~]# dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 1024+0 records in 1024+0 records out [root@tmr6s5 ~]# numactl --hardware available: 2 nodes (0-1) node 0 size: 5119 MB node 0 free: 3585 MB node 1 size: 4095 MB node 1 free: 2927 MB
Uh, that’s not good. I dumped some zeros into a file on /dev/shm and all the memory was allocated from socket 1. Lest anyone forget from my NUMA series (you did read that didn’t you?), writing memory not connected to your processor is, uh, slower:
[root@tmr6s5 ~]# taskset -pc 0-1 $$ pid 9453's current affinity list: 0,1 pid 9453's new affinity list: 0,1 [root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc 1024+0 records in 1024+0 records out real 0m1.116s user 0m0.005s sys 0m1.111s [root@tmr6s5 ~]# taskset -pc 1-2 $$ pid 9453's current affinity list: 0,1 pid 9453's new affinity list: 1 [root@tmr6s5 ~]# time dd if=/dev/zero of=/dev/shm/foo bs=1024k count=1024 conv=notrunc 1024+0 records in 1024+0 records out real 0m0.931s user 0m0.006s sys 0m0.923s
Yes, 20% slower.
What About Oracle?
So, like I said, I mounted that ORACLE_HOME on this Opteron server. What does an AMM instance look like? Here goes:
SQL> !numactl --hardware available: 2 nodes (0-1) node 0 size: 5119 MB node 0 free: 3587 MB node 1 size: 4095 MB node 1 free: 3956 MB
SQL> startup pfile=./amm.ora ORACLE instance started. Total System Global Area 2276634624 bytes Fixed Size 1300068 bytes Variable Size 570427804 bytes Database Buffers 1694498816 bytes Redo Buffers 10407936 bytes Database mounted. Database opened. SQL> !numactl --hardware available: 2 nodes (0-1) node 0 size: 5119 MB node 0 free: 1331 MB node 1 size: 4095 MB node 1 free: 3951 MB
Ick. This means that Oracle11g AMM on Opteron servers is a Cyclops. Odd how this allocation came from memory attached to socket 0 when the file creation with dd(1) landed in socket 1′s memory. Hmm…
What to do? SUMA? Well, it seems as though I should be able to interleave tmpfs memory and use that for /dev/shm-at least according to the tmpfs documentation. And should is the operative word. I have been tweaking for a half hour to get the mpol=interleave mount option (with and without the -o remount technique) to no avail. Bummer!
If AMD can’t get the Barcelona and/or Budapest Quad-core off the ground (and into high-quality servers from HP/IBM/DELL/Verari), none of this will matter. Actually, come to think of it, unless Barcelona is really, really fast, you won’t be sticking it into your existing Socket F motherboards because that doubles your Oracle license fee (unless you are on standard edition which is priced on socket count). That leaves AMD Quad-core adopters waiting for HyperTransport 3.0 as a remedy. I blogged all this AMD Barcelona stuff already.
Given the NUMA characteristics of /dev/shm, I think I’ll test AMM versus MMM on NUMA, and them test again on SUMA-if I can find the time.
If anyone can get /dev/shm mounted with the mpol option, please let me know because, at times, I can be quite a dolt and I’d love this to be one of them.