I have been exploring the effect of process migration between CPUs in a multi-core Linux system while running long duration Oracle jobs. While Linux does schedule processes as best as possible for L2 cache affinity, I do see migrations on my HP DL 585 Opteron 850 box. Cache affinity is important, and routine migrations can slow down long running jobs. In fact, when a process gets scheduled to run on a CPU different than the one it last ran on the CPU will stall immediately while the cache is loaded with the process’ page tables—regardless of cache warmth. That is, the cache might have pages of text, data, stack and shared memory, but it won’t have the right versions of the page tables. Bear in mind that we are talking really small stalls here, but on long running jobs it can add up.
CPU_BIND
This Linux Journal webpage has the source for a program called cpu_bind that uses the Linux 2.6 sched_setaffinity(2) library routine to establish hard-affinity for a process to a specified CPU. I’ll be covering more of this in my NUMA series, but I thought I’d make a quick blog entry about this new to get the ball rolling.
After downloading the cpu_bind.c program, it is simple to compile and execute. The following session shows compilation and execution to set the PID of my current bash(1) shell to execute with hard affinity on CPU 3:
$ cc -o cpu_bind cpu_bind.c
$ cpu_bind $$ 3
$ while true
> do
> :
> done
The following is a screen shot of top(1) with CPU 3 utilized 100% in user mode by my looping shell. Note, you may have to ricght-click->vew image:
If you wanted to experiment with Oracle, you could start a long running job and execute cp_bind on its PID once it is running, or do what I did with $$ and then invoke sqlplus for instance. Also, a SQL*Net listener process could be started with hard affinity to a certain CPU and you could connect to it when running a long CPU-bound job. Just a thought, but I’ll be showing real numbers in my NUMA series soon.
Give it a thought, see what you think.
The NUMA series links are:
I wonder how well this works with the new Intel architecture processors in the Core2Duo series, which utilize a shared cache for each processor. Is there even a latency invovled in a core switch other than a pipeline flush?
“In fact, when a process gets scheduled to run on a CPU different than the one it last ran on the CPU will stall immediately while the cache is loaded with the process’ page tables—regardless of cache warmth. ”
Indeed. I’ve seen reasonably fast PIII cpus grind down to 486 speeds by this cache stalling! I do recall alerting to this problem many years ago, in c.d.o.s. or oracle-l. There was something to cope with this in the Sequent’s PTS scheduler, wasn’t there? Or was it Pyramid’s OSX? Dang, trying to go back a few years now!…
I’m just wondering here:
Oracle background processes for example would have common text areas, it’s the oracle executable that is loaded for all of them. Presumably loaded into the same pages and shared across all processes, like all *n*x OSs seem to do with text.
Only the stack, bss and so on are separate for each process. Those would suffer most from cache affinity problems, rather than text?
I mean: if oracle process 25 is in CPU A and oracle process 56 in CPU B, then 56 re-schedules to A, the cached oracle executable pages in A from process 25 will be reused by 56? Of course the rest of the cached process 56 pages in B will need shuffling and the cache affinity problem creeps in.
IME, this cache affinity/CPU stall problem is one of the main causes of SMP Linux thrashing when seriously overloaded: seen it quite a few times at my previous job. Must try this C program out next time I get my mitts on a convenient Linux box. Thanks for pointing it out!
I wonder if Oracle in their Unbreakable Linux will ever take advantage of this? I can see where it would be of tremendous advantage, for example while sorting rows before the merge phase. Or during parsing and optimization of a given statement. Or indeed anywhere there is a high incidence of buffer cache access.
Maybe driven by the oracle code itself as it could decide when/where to bring affinity into play?
KT,
Good point. If the migration is between Woodcrest cores, the cost should be about nil. I’m working out Oracle workloads that fall prey to cross-socket migrations and will report what I find. In general though, there is nothing wrong with nailing a long running job down just to put the possibility out of mind. I’ll blog soon about how to affinity to a socket as opposed to a CPU (in my NUMA series).
Interesting stuff, Kevin. I’m looking at the output of DTrace scripts on parallel execution jobs for an upcoming presentation and was wondering how important processor affinity is. I can’t say I’m any the wiser, but this is all good food for thought.
Cheers
Noons,
Dang it I’m glad you are a reader. I spoke too generally. You are right about shared page tables, but Linux implements VM using multiple level of page tables (as many as 4 depending on the HW). The root, if you will, is the Page Global Directory (PGD) that is private to each process pointed to by mm_struct.pgd. The pgd is a physical page frame containing an array of pgd_t’s . When plucking a process to run, the kernel loads mm_struct-pgd into the cr3 register which flushes the TLB….stall time…all this before any real work gets done.
In the mid 90s we had a Sequent DYNIX/ptx cache affinity bug creep out of nowhere and was quickly fixed. It was weird. You could watch the processor LEDs flicker evenly even with a single process running—poison!
I’ll cover your bit about Unbreakable Linux and affinity in my NUMA series…
Thanks again for visiting this blog!
Doug,
Cache ffinity is difficult with PQO…unless you’ve worked inside a port that knew how to align producer-consumer slave relationships with NUMA nd cache in mind. Guess who did that? 🙂
If you just pick some PQOP slaves and slap them onto CPUs with cpu_bind you might find improved performance since the I/O blend of a PQO workload does tend stomp on cache a bit…if you have, say, 32 slaves on an 8 core box, just write a script to grab them and slap them to CPUs with cpu_bind to see what it does…4 per core…
you can also use ‘taskset’ (on centos/redhat in package schedutils) for binding processes on cpus.
regards
Kevin, I can see an advantage in binding LGWR to one CPU, as this would need to react very fast. Note, however, that the gruesome “quality” of Linux VM system which has 3 diferent levels (global page tables, segment tables and process page tables) will negate you more CPU cycles then CPU binding will buy you. Most of the systems purchased today are vastly over-configured with regard to CPU capacity, so that a production database at full steam will only consume 10% to 15% of the given CPU capacity. In other words, I don’t believe that the gain is worth the effort.
Mladen,
Your distain for Linux as a server operating system is well understood, Mladen. We should probably look at numbers though before we chuck the baby out with the bathwater.
“When plucking a process to run, the kernel loads mm_struct-pgd into the cr3 register which flushes the TLB….stall time…all this before any real work gets done”
ding-ding, light just went on! Yes, of course.
Thanks a lot for clarifying that!
Not at all, Noons, thanks for visiting!
I have noticed that on a CPU bound process, which was swapping CPUs more than once per second, binding the oracle process to one CPU made a fairly substantial difference to performance.
Is there any setting in the linux process scheduler which stops it being so ‘migration friendly’?