I spent the majority of my time in the Oracle Database 11g Beta program testing storage-related aspects of the new release. To be honest, I didn’t even take a short peek at the new Automatic Memory Management feature. As I pointed out the other day, Tanel Poder has started blogging about the feature.
If you read Tanel’s post you’ll see that he points out AMM-style shared memory does not use hugepages. This is because AMM memory segments are memory mapped files in /dev/shm. At this time, the major Linux distributions do not implement backing memory mapped files with hugepages as they do with System V-style IPC shared memory. The latter supports the SHM_HUGETLB flag passed to the shmget(P) call. It appears as though there was an effort to get hugepages support for memory mapped pages by adding MAP_HUGETLB flag support for the mmap(P) call as suggested in this kernel developer email thread from 2004. I haven’t been able to find just how far that proposed patch went however. Nonetheless, I’m sure Wim’s group is more than aware of that proposed mmap(P) support and if it is really important for Oracle Database 11g Automatic Memory Management, it seems likely there would be a 2.6 Kernel patch for it someday. But that begs the question: just how important are hugepages? Is it blasphemy to even ask the question?
Memory Mapped Files and Oracle Ports
The concept of large page tables is a bit of a porting nightmare. It will be interesting to see how the other ports deal with OS-level support for the dynamic nature of Automatic Memory Management. Will the other ports also use memory mapped files instead of IPC Shared Memory? If so, they too will have spotty large page table support for memory mapped files. For instance, Solaris 9 supported large page tables for mmap(2) pages, but only if it was an anonymous mmap (e.g., a map without a file) or a map of /dev/zero-neither of which would work for AMM. I understand that Solaris 10 supports large page tables for mmap(2) regions that are MAP_SHARED mmap(2)s of files-which is most likely how AMM will look on Solaris, but I’m only guessing. Other OSes, like Tru64-and I’m quite sure most others-don’t support large page tables for mmap(2)ed files. This will be interesting to watch.
Performance, Large Page Table, Etc
I remember back in the mid-90s when Sequent implemented shared large page tables for IPC Shared memory on our Unix variant-DYNIX/ptx. It was a very significant performance enhancement. For instance, 1024 shadow processes attached to a 1GB SGA required 1GB of physical memory-for the page tables alone! That was significant on systems that had very small L2 caches and only supported 4GB physical memory. Fast forwarding to today. I know people with Oracle 10g workloads that absolutely seize up their Linux (2.6. Kernel) system unless they use hugepages. Now I should point out that these sites I know of have a significant mix of structured and unstructured data. That is, they call out to LOBs in the filesystem (give me SecureFiles please). So the pathology they generally suffered without hugepages was memory thrashing between Oracle and the OS page cache (filesystem buffer cache). The salve for those wounds was hugepages since that essentially carves out and locks down the memory at boot time. Hugepages memory can never be nibbled up for page cache. To that end, benefiting from hugepages in this way is actually a by-product. The true point behind hugepages not the fact that it is reserved at boot time, but the fact that CPUs don’t have to thrash to maintain the physical to virtual translations (tlb). In general, hugepages are a lot more polite on processor caches and they reduce RAM overhead for page tables. Compared to the mid 1990s, however, RAM is about the least of our worries these days. Manageability is the most important and AMM aims to help on that front.
Confusion
Of all things Oracle and Linux, I think one of the topics that gets mangled the most is hugepages. The terms and nobs to twist run the gamut. There’s hugepages, hugetlb, hugetlbfs, hugetlbpool and so on. Then there are the differences from one Linux distribution and Linux kernel to the other. For instance, you can’t use hugepages on SuSE unless you turn off vm.disable_cap_mlock (need a few double negatives?). Then there is the question of boot-time versus /proc or sysctl(8) to reserve the pages. Finally, there is the fact that if you don’t have enough hugepages when you boot Oracle, Oracle will not complain-you just don’t get hugepages. I think Metalink 361323.1 does a decent job explaining hugepages with old and recent Linux in mind, but I never see it explained as succinctly as follows:
- Use OEL 4 or RHEL 4 with Oracle Database 10g or 11g
- Set oracle hard memlock N in /etc/security/limits.conf where N is a value large enough to cover your SGA needs
- Set vm.nr_hugepages in /etc/sysctl.conf to a value large enough to cover your SGA.
Further Confusion
Audited TPC results don’t help. For instance, on page 125 of this Full disclosure report from a recent Oracle10g TPC-C, there are listings of sysctl.conf and lilo showing the setting of the hugetlbpool parameter. That would be just fine if this was a RHEL3 benchmark since vm.hugetlbpool doesn’t exist in RHEL4.
Performance
I admit I haven’t done a great deal of testing with AMM, but generally a quick I/O-intensive OLTP test on a system with 4 processor cores utilized at 100% speak volumes to me. So I did just such a test.
Using an order-entry workload accessing the schema detailed in this Oracle Whitepaper about Direct NFS, I tested two configurations:
Automatic Memory Management (AMM). Just like it says, I configured the simplest set of initialization parameters I could:
UNDO_TABLESPACE=rb1 UNDO_MANAGEMENT = AUTO compatible = 10.1.0.0 control_files = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 ) db_block_size = 4096 MEMORY_TARGET=1500M db_files = 100 db_writer_processes = 1 db_name = bench processes = 200 sessions = 400 cursor_space_for_time = TRUE # pin the sql in cache filesystemio_options=setall
Manual Memory Management(MMM). I did my best to tailor the important SGA regions to match what AMM produced. In my mind, for an OLTP workload the most important SGA regions are the block buffers and the shared pool.
UNDO_TABLESPACE=rb1
UNDO_MANAGEMENT = AUTO
compatible = 10.1.0.0
control_files = ( /u01/app/oracle/product/11/db_1/rw/DATA/cntlbench_1 )
db_block_size = 4096
#MEMORY_TARGET=1500M
db_cache_size = 624M
shared_pool_size=224M
db_files = 100
db_writer_processes = 1
db_name = bench
processes = 200
sessions = 400
cursor_space_for_time = TRUE # pin the sql in cache
filesystemio_options=setall
The following v$sgainfo output justifies just how closely configured the AMM and MMM cases were.
AMM:
SQL> select * from v$sgainfo ;
NAME BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size 1298916 No
Redo Buffers 11943936 No
Buffer Cache Size 654311424 Yes
Shared Pool Size 234881024 Yes
Large Pool Size 16777216 Yes
Java Pool Size 16777216 Yes
Streams Pool Size 0 Yes
Shared IO Pool Size 33554432 Yes
Granule Size 16777216 No
Maximum SGA Size 1573527552 No
Startup overhead in Shared Pool 83886080 No
NAME BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available 0
MMM:
SQL> select * from v$sgainfo ;
NAME BYTES RES
-------------------------------- ---------- ---
Fixed SGA Size 1302592 No
Redo Buffers 4964352 No
Buffer Cache Size 654311424 Yes
Shared Pool Size 234881024 Yes
Large Pool Size 0 Yes
Java Pool Size 25165824 Yes
Streams Pool Size 0 Yes
Shared IO Pool Size 29360128 Yes
Granule Size 4194304 No
Maximum SGA Size 949989376 No
Startup overhead in Shared Pool 75497472 No
NAME BYTES RES
-------------------------------- ---------- ---
Free SGA Memory Available 0
The server was a HP DL380 with 4 processor cores and the storage was an HP EFS Clustered Gateway NAS. Before each test I did the following:
- Restore Database
- Reboot Server
- Mount NFS filesystems
- Boot Oracle
Before the MMM case I set vm.nr_hugepages=600 and after the database was booted, hugepages utilization looked like this:
$ grep Huge /proc/meminfo HugePages_Total: 600 HugePages_Free: 145 Hugepagesize: 2048 kB
So, given all these conditions, I believe I am making an apples-apples comparison of AMM to MMM where AMM does not get hugepages support but MMM does. I think this is a pretty stressful workload since I am maxing out the processors and performing a significant amount of I/O-given the size of the server.
Test Results
OK, so this is a very contained case and Oracle Database 11g is still only available on x86 Linux. I hope I can have the time to do a similar test with more substantial gear. For the time being, what I know is that losing hugepages support for the sake of gaining AMM should not make you lose sleep. The results measured in throughput (transactions per second) and server statistics are in:
Configuration | OLTP Transactions/sec | Logical IO/sec | Block Changes/sec | Physical Read/sec | Physical Write/sec |
AMM | 905 | 36,742 | 10,195 | 4,287 | 2,817 |
MMM | 872 | 36,411 | 10,101 | 4,864 | 2,928 |
Looks like 4% in the favor of AMM to me and that is likely attributed to the 13% more physical I/O per transaction the MMM case had to perform. That part of the results has me baffled for the moment since they both have the same buffering as the v$sgainfo output above shows. Well, yes, there is a significant difference in the amount of Large Pool in the MMM case, but this workload really shouldn’t have any demand on Large Pool. I’m going to investigate that further. Perhaps an interesting test would be to reduce the amount buffering the AMM case gets to force more physical I/O. That could bring it more in line. We’ll see.
Summary
I’m not saying hugepages is no help across the board. What I am saying is that I would weigh heavily the benefits AMM offers because losing hugepages might not make any difference for you at all. If it is, in fact, a huge problem across the board then it looks like there has been work done in this area for the 2.6 Kernel and it seems reasonable that such a feature (hugepages support for mmap(P)) could be implemented. We’ll see.
Hi Kevin
Out of curiosity… Why did you set compatible to 10.1.0.0?
Take care,
Chris
Christian,
Good question…too broad a stroke with the cut and paste I guess…would you wager a guess that it would change OLTP performance?
Right. Now fire off a workload which doesn’t know what a bind variable is but which needs to do a lot of large repeated queries differing only by the value of one or two variables.
In 10g, at any rate, AMM will start shovelling memory by the bucketload to the Shared Pool to deal with a near 100% miss ratio on the library cache, even though manually you know giving more memory to the buffer cache is the ‘right’ thing to do because the queries all hit one large table which could be cached quite effectively…
Yup, I know I should re-write the SQL so it uses bind variables. Still, given that I can’t, AMM is seriously bad news.
None of which refutes (or is intended to refute) one jot of what you wrote. Just that, AMM is seriously iffy unless you are right under the hump of the normal bell curve. Get even slightly weird and AMM is almost certainly the last thing you want working “for” you.
Howard,
Good feedback. I intend to torture it a bit more. I’m looking for good things beyond what 10g managed to offer… fingers crossed as they say.
Interesting as always, but do you have any feeling in how those results would vary with increasing size of SGA ? 624M seems a bit tiddly, if you are throwing 16, or 32 GB at your SGA, things may swing back in favour of using hugepages, no?
jason.
Howard:
You could still use AMM if you use “CURSOR_SHARING=[similar|force]” to get round your bind variable problem. I know there are issues associated with CURSOR_SHARING, but I’ve been forced to use it for a couple of 3rd party apps and it’s worked a treat.
Cheers
Tim…
Kevin
> would you wager a guess that it would change OLTP performance?
Honestly I don’t know. From one side I never really played with it. From the other side Oracle provides almost no information about the effects of COMPATIBLE. Some features are activated, some others not. For example you used MEMORY_TARGET… But, what about other performance features (like mutexes) introduced in later versions?
Therefore to do a performance comparison I would not specify it because the test may be biased. And since few people run databases with COMPATIBLE set… One may argue that is due to that.
Best,
Chris
Jason: This is 32bit Oracle. I can go larger but constrained by address space.
Tim: Thanks for the info
Chris: I’ll give it a purified whirl. No problem.
As you mentioned when you talked about the Sequent system with 1024 shadow processes connecting to a 1GB SGA using another 1GB of pagetable memory, hugepages have huge advantages with huge numbers of users. Think huge! I suspect that you didn’t max out the processes=200 parameter during your test, so you may not have noticed the advantage. Despite wasting 300MB of memory on unused hugepages.
I know this is an old post but …
“For instance, you can’t use hugepages on SuSE unless you turn off vm.disable_cap_mlock”
Do you mean vm.disable_cap_mlock=0? Does it depend on oracle version? SuSE version?
With SuSE 9.3 and 9.2.0.8 and 10.2.0.4 we are running vm.disable_cap_mlock=1 and huge pages. I am wondering if you comment applies to a specific version or am I missing something else? Thanks.
Hi Gabe,
Are you sure you’re getting hugepages in use? If so, then the information I blogged would be relevant to versions older than SuSE 9. I haven’t touched a SuSE box in nearly 4 years. Oracle MOS notes are pretty clean on disable_cap_mlock so I’d recomment tolling through MOS on the matter for up-to-date stuff.
cat /etc/SuSE-release
SUSE LINUX Enterprise Server 9 (x86_64)
VERSION = 9
PATCHLEVEL = 3
>cat /proc/sys/vm/nr_hugepages
2700
~>cat /proc/sys/vm/disable_cap_mlock
1
>grep Huge /proc/meminfo
HugePages_Total: 2700
HugePages_Free: 829
Hugepagesize: 2048 kB
The used huge pages match SGA. This is from a server with one DB. (I know it was really an overkill for this DB to setup huge pages.)
365607.1 lists disable_cap_mlock=1 as a requirement for 10g & SLES9 (not in relation to hugepages, just a general requirement). And there is a similar note for 9i and SLES9.
So, I am guessing you referred to an older version of SuSE (?)
Gabe,
To be honest, I cannot remember. That was quite some time ago and relevant at the time. Sorry if I’ve caused confusion.
it’s so helpful..In my case i have 64 GB ram. So i wanna go for MMM. So how should I configure my kernel parameters and everything?