Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part I

In my recent blog post entitled Standard File Utilities with Direct I/O, I covered the concept of using direct I/O filesystems for storing files to eliminate the overhead of caching them. Consider such files as archived redo. It makes no sense polluting memory with spooled archived redo logs. Likewise, if you compress archived redo it makes little sense to have those disk blocks hanging out in main memory. However, the thread discusses the fact that most operating system file tools do their work by issuing ridiculously small I/O requests. If you haven’t yet, I encourage you to read that blog entry.

The blog entry seeded a lively thread of comments—some touched on theory, others were evidence of entirely missing the point. One excellent comment came in that refreshed some long-lost memories of Solaris. The reader wrote:

For what its worth, cp on Solaris 10 uses mmap64 in 8MB chunks to read in data and 8MB write()s to write it out.

I did in fact know that but it had been a while since I have played around on Solaris.

The Smell Test
I wasn’t fortunate enough to have genius passed to me through genetics. Oh how it seems life would be simpler if that were the case. But it isn’t. So, my motto is, “99% Perspiration, 1% Inspiration.” To that end, I learned early on in my career to develop the skills needed to spot something that cannot be correct—before bothering myself with whether or not it is in fact correct. There is a subtle difference. As soon as the Solaris mmap() enabled cp(1) thing cropped up, I spent next to no time at all pondering how that must certainly be better than normal reads (e.g., pread()) since it failed my smell test.

Ponder Before We Measure
What did I smell? There is just no way that walking through the input file by mapping, unmapping and remapping 8MB at a time could be faster than simply reusing a heap buffer. No way at all. After all, mmap() has to make VM adjustments that are not exactly cheap so taxing every trip to disk with a vigorous jolt of VM overhead makes little sense.

There must have been some point in time when a cp(1) implemented with mmap() was faster, but I suspect that was long ago. For instance, perhaps back in the day before pread(1)/pwrite(). Before these calls, positioning and reading a file required 2 kernel dives (one to seek and the other to read). Uh, but hold it. We are talking about cp(1) here—not random reads—where each successful read on the input file automatically adjusts the file pointer. That is, the input work loop would never have been encumbered with a pair of seek and read. Hmmm. Anyway, we can guess all day long why the Solaris folks chose to have cp(1) use mmap(2) as its input work horse, but in the end we’ll likely never know.

A Closer Look
In the following truss output, the Solaris cp(1) is copying a 5.8MB file to an output file called “xxx.” After getting a file descriptor for the input file, the output file is created. Next, mmap() is used on the input file (reading all 5.8MB since it is smaller than the 8MB operation limit). Next, the write call is used to write all 6161922 bytes from the mmap()ed region out to the output file (fd 4).

open64(”SYSLOG-4″, O_RDONLY) = 3
creat64(”xxx”, 0777) = 4
stat64(”xxx”, 0×00028640) = 0
fstat64(3, 0×000286D8) = 0
mmap64(0×00000000, 6161922, PROT_READ, MAP_SHARED, 3, 0) = 0xFEC00000
write(4, ” F e b 5 1 0 : 3 6″.., 6161922) = 6161922
munmap(0xFEC00000, 6161922) = 0

Of course if the file happened to be larger than 8MB, cp(1) would unmap and then remap the next chunk and on it would proceed in a loop until the input EOF is reached. That is a lot more “moving parts” than simply calling read(2) over and over clobbering the contents of a buffer allocated at the onset of the copy operation—without continual agitation of the VM subsystem with mmap().

I couldn’t imagine how cp(1) using mmap() would be any faster than read(2)/write(2). But then, it actually only replaces the input read with an mmap() while using write(2) on the output side. I couldn’t imagine how replacing just the input portion with mmap() would be faster than a cp() that uses a static heap buffer with read/write pairs. Moreover, I couldn’t picture how the mmap() approach would be easier on resources.

Measure Before Blogging
Not exactly me since I don’t have much Solaris gear around here. I asked Padraig O’Sullivan to compare the stock cp(1) of Solaris 10 to a GNU cp(1) with the modification I discuss in this blog entry. The goal at hand was to test whether the stock cp(1) constantly mapping and unmapping the input file is somehow faster or more gentle on processor cycles than starting out with a heap buffer and reusing it. The latter is exactly what GNU cp(1) does of course. Padraig asked:

One question about the benchmark you want me to run (I want to make sure I get all the data you want) - is this strategy ok?

1. Mount a UFS filesystem with the forcedirectio option
2. Create a 1 GB file on this filesystem
3. Copy the file using the standard cp(1) utility and record timing statistics
4. Copy the file using your modified cp8M utility and record timing statistics

Let me know if this is ok or if you want more information for the benchmark.

 

There was something else. I wanted a fresh system reboot just prior to each copy operation to make sure there were no variables. Padraig had the following to report:

[…] manage to run the benchmark in the manner you requested this morning […] Below is the results. I rebooted the machine before performing the copy each time.

# ls -l large_file
-rw-r–r– 1 root root 1048576000 Mar 5 10:25 large_file

# time /usr/bin/cp large_file large_file.1

real 2m17.894s
user 0m0.001s
sys 0m10.853s

# time /usr/bin/cp8m large_file large_file.2

real 1m57.932s
user 0m0.002s
sys 0m8.057s

Look, I’m just an old Sequent hack and yet the results didn’t surprise me. The throughput increased roughly 16% from 7.3MB/s to 8.5MB/s. What about resources? The tweaked GNU cp8M utilized roughly 26% less kernel mode processor cycles to do the same task. That’s not trivial since we didn’t actually eliminate any I/O. What’s that? Yes, cp8M reduces the wall clock time it takes to copy a 1000MB file by 16%–without eliminating any physical I/O!

Controversy
Yes, blogging is supposed to be a bit controversial. It cultivates critical thought and the occasional gadfly, fly-by commentary I so dearly appreciate. Here we have a simple test that shows a “normal” cp(1) is slightly faster and significantly lighter on resources than the magical mmap()-enabled cp(1) that ships with Solaris. Does that mean there isn’t a case where the mmap() style is better? I don’t know. It could be that some really large Solaris box with dozens of concurrent cp(1) operations would show the value of the mmap() approach. If so, let us know.

What Next?
I hope someone will volunteer to test cp8M on a high-end Solaris system. Maybe the results will help us understand why cp(1) uses mmap() on Solaris for its input file. Maybe not. Guess which way I’d wager.

4 Responses to “Copying Files on Solaris. Slow or Fast, It’s Your Choice. Part I”


  1. 1 Mike March 16, 2007 at 10:22 am

    Perhaps a fairly obvious statement this, but notice the use of MAP_SHARED on the mmap call? - (I suspect you’ve spotted that already). This means that multiple processes can attach to the same memory mapped file simultaneously.

    That’s not to say that they all need to be “cp”’s - anything using mmap() on the same file at about same time will yield a benefit - the 8MB chunk paged in by mmap should only be later reclaimed by the pagescanner (or when the last process detaches?).

    Also - I think the mmap() method has the potential for using less memory (each process doesn’t necessarily need to have a large heap buffer) - probably not a concern these days, but then, I suspect that “cp” was written an awful long time ago.

    Part of the motivation behind implementing memory mapped I/O was to avoid the double-buffering associated with read() calls.. (read from hardware into kernel space, then copy the data to the buffer to userspace) - it does, of course come with an associated cost (increased memory management overhead), but then, that’s the trade-off, isn’t it..?

    There may be some cases where mmap() is more efficient, but I think it would be difficult to suggest that there isn’t any cases where read() is best.

  2. 2 kevinclosson March 16, 2007 at 6:41 pm

    Mike,

    Your point about the kernel bcopy from UFS read buffers to the heap buffer in the address space of the cp(1) process is a good one, but this is a forcedirectio case. I’ve got some readers comparing on normal mounts and the results could indeed be lot different in that case.

    Since this is an Oracle blog, I would naturally go with the forecedirectio comparison first. It will be interesting to see with a normal UFS mount.

    I’ve got a $2 bet that the MAP_SHARED is only there to facilitate copying an already mmapped file…the odds of a process jumping in and sharing a 8MB map that only lives for the duration of an I/O in and an I/O out seems pretty slim to me…but then that is 8MB twice…hmmm…I guess that 8MB mmap could exist for as much as 2-3 seconds if the I/O is headed for a single, simple drive. Sounds like a race just to share an 8MB map to me.

  3. 3 Mike March 18, 2007 at 8:03 pm

    > but this is a forcedirectio case

    Apologies if I’m misinterpreting your response here, but I don’t think this matters .. I wasn’t talking about the page cache (which would be effectively disabled on a forcedirectio filesysem), I meant the kernel read buffer - the kernel won’t/can’t write directly to the userspace buffer, so when it recieves the data in from the VFS module (in this case, UFS), it needs to store it somewhere (directio_buf_cache kmem).

    As I understand it, when working *without* directio, it’ll also pass through the page cache, which will improve some kinds of I/O, but that’s an aside that I wasn’t really considering here.

    Using read() and directio implies a very long, single-threaded operation (user process calls read(), kernel requests data from vfs, waits for result, copyout()’s the results back to the userland process).

    Memory mapped I/O (even on a directio filesystem) suggests a more “multithreaded” approach, because the memory management system will be paging in the data (mmmm - would it prefetch on a directio filesystem?), while the user thread is getting on with its’ work (although this is a pretty simplistic example, because “cp” is immediately write()’ing the full buffer size out again).

    *disclaimer - I fully reserve the right to be mistaken on any of the above - this is all based on my understanding, and happy to be otherwise educated ;)

  4. 4 aashok July 4, 2008 at 11:26 am

    ls -l allcp-test.dat
    -rw-r–r– 1 root root 5287847424 Jul 4 13:08 allcp-test.dat

    #time /opt/sfw/bin/cp allcp-test.dat /export/home/sunteam/

    real 3m20.497s
    user 0m1.099s
    sys 0m56.181s

    (cp is from gnu using read/write method)
    #time /usr/bin/cp allcp-test.dat /export/home/sunteam/

    real 2m11.940s
    user 0m0.004s
    sys 0m26.583s
    (cp is from solaris internal using mmap/munmap method)

Leave a Reply