Oracle Database File System (DBFS). It’s Not An “Exadata Feature.” – Part I.

Last summer I posted a blog entry about the Oracle Database 11g feature known as Database File System (DBFS). The blog entry I made on the topic was about how suitable DBFS is as a staging file system for ELT/ETL operations. Since then I’ve also offered a webcast covering some DBFS related topics. With all that, one would naturally presume DBFS is an Exadata-only feature. It is not. I do, however,  get frequent requests for Exadata-specific tuning tips for DBFS. There are none.

Deploying DBFS in an Exadata environment is exactly the same as with non-Exadata storage–just faster.

I do feel that DBFS is much more than just a curiosity. Indeed, a file system stored in a database? Really? Yes! What a twist though given the fact that it has always been the other way around!

I aim to make a few blog entries soon to offer some compelling reasons why Oracle shops might do well to look into the future for areas in which DBFS can help solve problems. Indeed, others in the blogosphere have caught on to the fact that DBFS is not an Exadata-only feature as demonstrated by Ronny Egner’s excellent blog post on DBFS.

I’d also like to draw attention to this excellent Power Point presentation on DBFS as well.

So, I’m going to conclude with this as Part I in a series where I’ll cover some setup, diagnostic and performance information about DBFS.

More Words About Oracle Direct NFS On Sun Storage 7410 And Chip Multithreading Technology (CMT) For Oracle Database 11g Release 2.

In my recent blog entry entitled Words About Oracle Direct NFS On Sun Storage 7410 And Chip Multithreading Technology (CMT) For Oracle Database 11g Release 2 I discussed how my old friend Glenn Fawcett and I are studying OLTP performance characteristics of dual-socket Xeon 5500 (Nehalem EP) versus the Sun T5240 (also dual-socket). After I set Glenn up with a nice flexible OLTP workload, Glenn started collecting results and analyzing Oracle Direct NFS (DNFS) performance. Now that he has finished testing with DNFS over 10GbE, he posted a fresh blog entry on the matter that can be found at the following link:

The previous installments on that thread can be found here:

Glenn measured some 90,000 random 8KB physical IOPS over 10GbE and over 1GB/s when scanning disk!

As I’ve been saying for quite some time, “NFS, it ain’t just for kids no more!”

Now, just in case anyone thinks we are wasting our time with some sort of “trick” workload, take my word for it—this workload is grueling. Although I haven’t seen the AWR report from the 90,000+ IOPS 10GbE run, I know enough about the T5240 and this workload to guesstimate that the logical I/O rate for SGA buffer hits would have been on the order of 250,000/second and the CPUs were likely about 60% utilized. I know Glenn reads my blog and he’ll see the trackback from this post so lets see if he can chime in here and correct me on that LIO, CPU guesstimate…Glenn?

Pardon Me, Where Is That Flash Cache? Part II.

In Part I of this series about Database Flash Cache I laid a bit of groundwork on the topic of Database Flash Cache and how it differs from Exadata Smart Flash Cache. I’d like to continue this series by discussing some of the mechanisms associated with Database Flash Cache. But first, I’d like to offer a little more background in case the architecture of Database Flash Cache is still confusing.

Fast Forward To The Past
During the mid-to-late 1990s, and earlier part of this decade,  DBAs would routinely find themselves managing 32-bit Oracle databases on systems with very large physical memory and a 64-bit Operating System. These systems would support a system buffer cache limited only by physical memory. In order to help reduce physical I/O administrators would sometimes use buffered file system files for Oracle. Oracle I/Os were therefore DMAed in the system buffer cache and then copied in kernel mode into the address space of the calling Oracle process. Re-reads from the system buffer cache eliminated physical I/Os. This model came with significant penalties such as the fact that all data in the SGA buffer cache was also present in the system buffer cache (waste of memory). The memory copy operations from the system buffer cache into the SGA buffers (for every physical I/O) was quite a load on systems of the time period as well.

While working in Sequent Database Engineering in the 1990s I was one of the folks pushing for a model quite similar to how Database Flash Cache works to address Oracle caching needs.  Instead of files on Flash, large “buffer cache files” were to reside in main memory. The proposal was such that when a buffer aged from the SGA buffer pool it would be written with a non-synced write to the “buffer cache file.” Disk reads would occur directly into the SGA. If data happened to be in the “buffer cache file” it would be read from RAM thus eliminating a physical I/O. That would have been much better than the double-buffering effect of using buffered file system files. As it turns out, it would have also been quite similar to Database Flash Cache architecture.

None of that ever came to pass, however, because we devised and implemented Indirect Data Buffers as a much better approach than leveraging system buffer cache. However, that didn’t keep me from pursuing NUMA-optimized disk caching technology as my NUMA-optimized external cache patent sort of suggests.

What does this have to do with Database Flash Cache?

Is that a Logical Read or a Physcial Read? Yes!
The real goal of Database Flash Cache is to reduce physical disk reads. It can’t do anything for writes. Most OLTP/ERP applications do more reading than writing so I think this is OK. The interesting turn on this is that physical disk reads from Database Flash Cache occur in the logical I/O code path.

A logical I/O that hits a buffer previously aged to the Flash Cache requires a physical read to bring the buffer back into the address space of the process. Say what? This technology puts a physical read in the logical read code path? Yes.

Think of it this way. If a process doesn’t find the block it needs in the buffer cache, in a non-Database Flash Cache scenario, it has to perform a read from disk. In the case of Database Flash Cache there is a miss in the first level SGA buffer pool followed by a hit in the second level SGA buffer pool and at that point there is a physical disk read fromFlash. The read from Flash will be quite fast and in the case of a PCI Flash card there won’t even be interaction with a central disk controller since each PCI Flash card has its own built-in disk controller.

In the next installment in this series of blog posts I will cover a performance comparison of Database Flash Cache. However, I need to end this installment with a quick look into how the instance accesses the Database Flash Cache file.

It’s Cache, But It’s A File
In the following box I’ll show the initialization parameters and DBWR open files for an instance that is using a Database Flash Cache file.

NOTE: If you hover over the upper right hand side of the box you’ll see a widget that will allow you to view the box in wide screen

SQL> show parameter db_flash

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_flash_cache_file                  string      /home/cache/scratchfile
db_flash_cache_size                  big integer 31G

# ps -ef | grep dbw
oracle    6466     1  0 Dec07 ?        00:00:35 ora_dbw0_OLTP1
oracle    6468     1  0 Dec07 ?        00:00:39 ora_dbw1_OLTP1
oracle   10428     1  0 Oct20 ?        00:02:59 asm_dbw0_+ASM1
oracle   11733 26379  0 08:17 pts/12   00:00:00 grep dbw
# ls -l /proc/6466/fd | grep file
lrwx------ 1 oracle oinstall 64 Dec  7 19:47 22 -> /home/cache/scratchfile
lrwx------ 1 oracle oinstall 64 Dec  7 19:47 33 -> /home/cache/scratchfile

OK, so that showed us the the DBWR process for the OLTP1 instance has two file descriptors open on the Database Flash Cache file. After starting an OLTP workload I straced DBWR to see how it was interacting with the Flash Cache file:


strace -p 6466 2>&1 | head -10
Process 6466 attached - interrupt to quit
pwrite(33, "\6\242\0\0\345]\0\0HP\316\252\0\0\2\4\371\376\0\0\1\0\0\0\345E\1\0\226L\316\252"..., 8192, 29339877376) = 8192
pwrite(33, "\6\242\0\0004o\0\0\247l\316\252\0\0\2\4\2404\0\0\2\0\0\0\351E\1\0005l\316\252"..., 8192, 29339811840) = 8192
pwrite(33, "\6\242\0\0\344\27\0\0_M\316\252\0\0\2\4t\262\0\0\1\0\0\0\345E\1\0\226L\316\252"..., 8192, 29339795456) = 8192
pwrite(33, "\6\242\0\0\2202\3\0\221R\316\252\0\0\2\4\330\244\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 29338681344) = 8192
pwrite(33, "\6\242\0\0\310S\0\0\17O\316\252\0\0\2\4\262z\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 29338583040) = 8192
pwrite(33, "\6\242\0\0[\236\7\0GU\316\252\0\0\2\4.\301\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 29338492928) = 8192
pwrite(33, "\6\242\0\0\371\277\2\0\364]\316\252\0\0\2\4\371\315\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 29336969216) = 8192
pwrite(33, "\6\242\0\0\271\353\0\0\332e\316\252\0\0\2\4\220\262\0\0\2\0\0\0\350E\1\0\247e\316\252"..., 8192, 29336190976) = 8192
pwrite(33, "\6\242\0\0003F\v\0\227[\316\252\0\0\2\4~*\0\0\1\0\0\0\340E\1\0_L\316\252"..., 8192, 31513272320) = 8192

Well, that's not good! DBWR is performing synchronous writes to the Flash Cache file. No mystery, I forgot to set filesystemio_options=setall. After doing so, DBWR uses libaio calls to spill into the Flash Cache file (the 5th word in each iocb is the file descriptor). The following snippet shows a blast of 165 writes to the Flash Cache file:


io_submit(47959748861952, 165, {{0x2b9e7fb02630, 0, 1, 0, 33}, {0x2b9e7faa3c70, 0, 1, 0, 33}, {0x2b9e7fa545f8, 0, 1, 0, 33}, {0x2b9e7fb91a70, 0, 1, 0, 33},

{0x2b9e7fa276f0, 0, 1, 0, 33}, {0x2b9e7faf68f8, 0, 1, 0, 33}, {0x2b9e7fafcfb0, 0, 1, 0, 33}, {0x2b9e7fc09f10, 0, 1, 0, 33}, {0x2b9e7fb8b920, 0, 1, 0, 33},

{0x2b9e7fa91880, 0, 1, 0, 33}, {0x2b9e7fb05170, 0, 1, 0, 33}, {0x2b9e7fb80c20, 0, 1, 0, 33}, {0x2b9e7faa6d18, 0, 1, 0, 33}, {0x2b9e7faba140, 0, 1, 0, 33},

[ .. many lines deleted ...]

{0x2b9e7fbb6d20, 0, 1, 0, 33}, {0x2b9e7faa1c00, 0, 1, 0, 33}, {0x2b9e7faeabc0, 0, 1, 0, 33}, {0x2b9e7fa7f490, 0, 1, 0, 33}, {0x2b9e7fb8a380, 0, 1, 0, 33},

{0x2b9e7fad5c90, 0, 1, 0, 33}}) = 165

So we’ve seen the mechanisms used by DBWR to spill into the Flash Cache file. What about foreground processes? The following shows a shadow process performing 8KB synchronous reads from the Flash Cache file. These pread() callsphysical I/O and all, are actually a form of logical I/O. We are talking about cache hits after all.

$ ls -l /proc/32719/fd | grep scratch
lrwx------ 1 oracle oinstall 64 Dec 10 09:19 14 -> /home/cache/scratchfile
[oracle@dscgif05 dbs]$ strace -p 32719 2>&1 | grep pread | head -20
pread(14, "\6\242\0\0\256\215\1\0\337e\316\252\0\0\2\4\304\34\0\0\2\0\0\0\350E\1\0\316e\316\252"..., 8192, 23248191488) = 8192
pread(14, "\6\242\0\0\204\31\6\0\366\367\\\253\0\0\1\4z\321\0\0\1\0\0\0\340E\1\0\366\367\\\253"..., 8192, 17962737664) = 8192
pread(14, "\6\242\0\0\202:\1\0\366S\316\252\0\0\2\4\25\301\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 32084803584) = 8192
pread(14, "\6\242\0\0\204V\5\0\346\272\\\253\0\0\1\4\207\247\0\0\1\0\0\0\340E\1\0\346\272\\\253"..., 8192, 27747041280) = 8192
pread(14, "\6\242\0\0\337\305\0\0\337e\316\252\0\0\2\4ey\0\0\2\0\0\0\350E\1\0ne\316\252"..., 8192, 30495244288) = 8192
pread(14, "\6\242\0\0= \5\0\2035h\253\0\0\1\4)\223\0\0\1\0\0\0\340E\1\0\2035h\253"..., 8192, 20019929088) = 8192
pread(14, "\6\242\0\0\223\v\1\0,T\337\252\0\0\1\6\316\207\0\0\1\0\0\0\342E\1\0+T\337\252"..., 8192, 28805251072) = 8192
pread(14, "\6\242\0\0\214\256\0\0\332e\316\252\0\0\2\4\5a\0\0\2\0\0\0\350E\1\0ne\316\252"..., 8192, 31322808320) = 8192
pread(14, "\6\242\0\0\241w\4\0#\232m\253\0\0\1\4\334A\0\0\1\0\0\0\340E\1\0#\232m\253"..., 8192, 31416205312) = 8192
pread(14, "\6\242\0\0\304<\1\0\374S\316\252\0\0\2\4>\257\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 24838955008) = 8192
pread(14, "\6\242\0\0`\201\1\0\332e\316\252\0\0\2\4=*\0\0\2\0\0\0\350E\1\0\316e\316\252"..., 8192, 21697290240) = 8192
pread(14, "\6\242\0\0z\300\5\0\341Tl\253\0\0\1\4\365\315\0\0\1\0\0\0\340E\1\0\341Tl\253"..., 8192, 21481119744) = 8192
pread(14, "\6\242\0\0O\243\1\0\365e\316\252\0\0\2\4\261!\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 22630391808) = 8192
pread(14, "\6\242\0\0\221Z\r\0\240B]\253\0\0\1\4zK\0\0\1\0\0\0\340E\1\0\240B]\253"..., 8192, 32498204672) = 8192
pread(14, "\6\242\0\0\222Z\r\0\t\335g\253\0\0\1\4$\316\0\0\1\0\0\0\340E\1\0\t\335g\253"..., 8192, 25871138816) = 8192
pread(14, "\6\242\0\0o\245\1\0\365e\316\252\0\0\2\4s\272\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 30909734912) = 8192
pread(14, "\6\242\0\0\373i\r\0\211\10b\253\0\0\1\4.q\0\0\1\0\0\0\30E\1\0\211\10b\253"..., 8192, 27650859008) = 8192
pread(14, "\6\242\0\0\374i\r\0kTc\253\0\0\1\4v0\0\0\1\0\0\0\340E\1\0kTc\253"..., 8192, 18883878912) = 8192
pread(14, "\6\242\0\0t\357\2\0\2678\344\252\0\0\1\6R3\0\0\1\0\0\0\342E\1\0\345K\316\252"..., 8192, 23501406208) = 8192
pread(14, "\6\242\0\0003\247\1\0\370e\316\252\0\0\2\4\314\35\0\0\2\0\0\0\350E\1\0Ue\316\252"..., 8192, 27493490688) = 8192
$ ps -ef | grep 32719 | grep -v grep
oracle   32719 32700  4 09:17 ?        00:00:05 oracleOLTP1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

Summary
This post in the series aimed to show some of the mechanisms involved when processes interact with the Database Flash Cache file. In Part III I’ll be sharing some performance numbers.

Next Page »


Disclaimer

The views expressed on this blog are my own and do not reflect the views of Oracle Corporation. The views and opinions expressed by visitors on this blog are theirs, not mine.

All information and materials provided here are provided "as-is"; Oracle disclaims all express and implied warranties, including, the implied warranties of merchantability or fitness for a particular use. Oracle shall not be liable for any damages, including, direct, indirect, incidental, special or consequential damages for loss of profits, revenue, data or data use, incurred by you or any third party in connection with the use of this information or these materials.
website metrics