Using SSD for a temp tablespace on Exadata

I seem to be getting a lot of surprising performance results lately on our X-2 quarter rack Exadata system, which is good – the result you don’t expect is the one that teaches you something new.

This time, I was looking at using a temporary tablespace based on flash disks rather than spinning disks.  In the past – using Fusion IO PCI cards, I found that using flash for temp tablespace was very effective in reducing the overhead of multi-pass sorts:

image

See (http://guyharrison.squarespace.com/ssdguide/04-evaluating-the-options-for-exploiting-ssd.html)

However, when I repeated these tests for Exadata, I got very disappointing results.  SSD based temp tablespace actually lead to marginally worse performance:

image

Looking in depth at a particular point (the 500K SORT_AREA_SIZE point), we can see that although the SSD based temp tablespace has marginally better read times, it involves a significantly higher write overhead:

image

I can understand the higher read overhead (at least partially).  It’s Yet Another time when sequential write operations to an SSD device have provided disappointing performance.  However, it’s strange to see such poor read performance.  How can a spinning disk serve blocks up at effectively the same latency an SSD?

So I dumped all the direct path read waits from a 10046 trace and plotted them logarithmically:

image

We can see in this chart, that the SDD based tablespace suffers from a small “spike” of high latencies between 600-1000 us (eg .6-1 ms).  These are extremely high latencies for an SSD !  What could be causing them?  Garbage collection being caused by the almost writes to the temp tablespaces?  There was negliglbe concurrent activity on the system and the table concerned had flash cache disabled so for now that is my #1 theory. 

For that matter, why are the HDD reads times so low?  An average disk read latency of 500 us for a spinning disk is unreasonably low, is the storage cell somehow buffering temporary tablespace IO?  

As always I’m wondering if there’s someone with more expertise in Exadata internals who could shed some light on all of this!

Can the Exadata Smart Flash Cache slow smart scans?

I’ve been doing some work on the Exadata Smart Flash Cache recently and came across a situation in which setting CELL_FLASH_CACHE to KEEP will significantly slow down smart scans on a table.

If we create a table with default settings, then the Exadata Smart Flash Cache (ESFC) will not be involved in smart scans, since by default only small IOs get cached.  If we want the ESFC to be involved, we need to set the CELL_FLASH_CACHE to KEEP.  Of course, we don’t expect immediate improvements, since we expect that the next smart scan will need to populate the cache before subsequent scans can benefit. 

HOWEVER, what I’m seeing in practice is that the next smart scan following an ALTER TABLE … STORAGE(CELL_FLASH_CACHE KEEP) is significantly degraded, while subsequent scans get a performance boost.  Here’s an example of what I observe:

image

 

The big increase in CELL IO time is in an increase in both the number and latency of cell smart table scans.  The wait stats for the first scan with a default setting looked like this:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  gc cr disk read                                 1        0.00          0.00
  cell single block physical read                 2        0.01          0.01
  row cache lock                                  2        0.00          0.00
  gc cr grant 2-way                               1        0.00          0.00
  SQL*Net message to client                    1021        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: KO - fast object checkpoint                2        0.00          0.00
  cell smart table scan                        9322        0.14          7.60
  SQL*Net message from client                  1021        0.00          0.02

For the first scan with KEEP cache it looked like this:

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                    1021        0.00          0.00
  reliable message                                1        0.00          0.00
  enq: KO - fast object checkpoint                2        0.00          0.00
  cell smart table scan                       14904        1.21         33.37
  SQL*Net message from client                  1021        0.00          0.02

Looking at the raw trace file didn’t help – it just shows a bunch of lines like this, with only a small number (3 in this case) of unique cellhash values… I couldn’t see a pattern:

WAIT #… : nam='cell smart table scan' ela= 678 cellhash#=398250101 p2=0 p3=0 obj#=139207 tim= …

I’m at a loss to understand why there would be such a high penalty for the initial smart scan with CELL_FLASH_CACHE KEEP setting.  You expect some overhead from constructing and storing the result set blocks in the cache, but an IO penalty of 200=300% seems way too high.   Anybody seen anything like this or have a clear explanation?

Test script is here, and formatted tkprof here

Redo on SSD: effect of redo size (Exadata)

Of all the claims I make about SSD for Oracle databases, the one that generates the most debate is that placing redo logs on SSD is not likely to be effective.  I’ve published data to that effect in particular see  Using SSD for redo on Exadata - pt 2 and 04 Evaluating the options for Exploiting SSD.

I get a lot of push back on these findings – often on theoretical grounds from Flash vendors (“our SSD use advanced caching and garbage collection that support high rates of sequential IO”) or from people who say that they’ve used flash for redo and it “worked fine”.

Unfortunately, every single test I do comparing performance of redo on flash and HDD shows redo with little or no advantage and in some cases with a clear disadvantage.    

One argument for flash SSD that I’ve heard is that while for the small transactions I use for testing  flash might not have the advantage but for “big” redo writes – such as those associated with LOB updates – flash SSD would work better.  The idea is that the overhead of garbage collection and free page pool processing is less with big writes since you don’t hit the same flash SSD pages in rapid succession as you would with smaller writes.    On the other hand a reader who knows more about flash than I do (flashdba.com) recently commented:  “in foreground garbage collection a larger write will require more pages to be erased, so actually will suffer from even more performance issues.”

It’s taken me a while to get around to testing this, but I tried on our Exadata X-2 recently with a test that generates a variable amount of redo and then commits.  The relationship between the size of the redo and redo log sync time is shown below

image

 

I’m now putting on my flame retardant underwear in anticipation of some dispute over this data….   but,  this suggests that while SSD and HDD (at least on Exadata) are about at parity for small writes, flash degrades much more steeply than HDD as the size of the redo entry increases.  Regardless of whether the redo is on flash or HDD, there’s a break at the 1MB point which corresponds to log buffer flush threshold.  When a redo entry is only slightly bigger than 1MB then the chances are high that some of  it will have been flushed already – see Redo log sync time vs redo size for a discussion of this phenomenon.

The SSD redo files were on an ASM disk group carved out of the Exadata flash disks - see Configuring Exadata flash as grid disk to see how I created these.  Also the redo logs were created with 4K blocksize as outlined in Using SSD for redo on Exadata - pt 2.   The database was in NoarchiveLog mode.  Smart flash logging was disabled.  As far as I can determine, there was no other significant activity on the flash disks (the grid disks were supporting all the database tablespaces, so if anything the SSD had the advantage). 

Why are we seeing such a sharp dropoff in performance for the SSD as the redo write increases in size?   Well one explanation was given by  flashdba in this comment thread.  It has to do with understanding what happens when a write IO which modifies an existing block hits a flash SSD.   I tried to communicate my limited understanding of this process in Fundamentals of Flash SSD Technology.   Instead of erasing the existing page, the flash controller will pull a page off a “free list” of pages and mark the old page as invalid.   Later on, the garbage collection routines will reorganize the data and free up invalid pages.  In this case, it’s possible that no free blocks were available because garbage collection fell behind during the write intensive workload.  The more blocks written by LGWR, the more SSD pages had to be erased during these un-optimized writes and therefore the larger the redo log write the worse the performance of the SSD.

Any other theories and/or observations?  

I hope soon to have a Dell system with Dell express flash so as I can repeat these tests on a non-exadata system.  The F20 cards used in my X-2 are not state of the art, so it’s possible that different results could be obtained with a more recent flash card, or with a less contrived workload.

However, yet again I’m gathering data that suggests that using flash for redo logs is not worthwhile.  I’d love to argue the point but even better than argument would be some hard data in either direction….

Exadata Write-back cache and free buffer waits

Prior to storage server software version 11.2.3.2.0 (associated with Exadata X3), Exadata Smart Flash Cache was a “write-through” cache, meaning that write operations are applied both to the cache and to the underlying disk devices, but are not signalled as complete until the IO to the disk has completed.

Starting with 11.2.3.2.0 of the Exadata storage software[1], Exadata Smart Flash Cache may act as a write-back cache. This means that a write operation is made to the cache initially and de-staged to grid disks at a later time. This can be effective in improving the performance of an Exadata system that is subject to IO write bottlenecks on the Oracle datafiles.

Writes to datafile generally happen as a background task in Oracle, and most of the time we don’t actually “wait” on these IOs. That being the case, what advantage can we expect if these writes are optimized? To understand the possible advantages of the write-back cache let’s review the nature of datafile write IO in Oracle and the symptoms that occur when write IO becomes the bottleneck.

When a block in the buffer cache is modified, it is the responsibility of the database writer (DBWR) to write these “dirty” blocks to disk. The DBWR does this continuously and uses asynchronous IO processing, so generally sessions do not have to wait for the IO to occur – the only time sessions wait directly on write IO is when a redo log sync occurs following a COMMIT.

However, should all the buffers in the buffer cache become dirty then a session may wait when it wants to bring a block into the cache – resulting in a “free buffer wait”.

image

Free buffer waits can occur in update-intensive workloads when the IO bandwidth of the Oracle sessions reading into the cache exceeds the IO bandwidth of the database writer. Because the database writer uses asynchronous parallelized write IO, and because all processes concerned are accessing the same files, free buffer waits usually happen when the IO subsystem can service reads faster than it can service writes.

There exists just such an imbalance between reads and write latency in Exadata X2 – the Exadata Smart Flash Cache accelerates reads by a factor of perhaps 4-10 times, while offering no comparable advantage for writes. As a result, a very busy Exadata X2 system could become bottlenecked on free buffer waits. The Exadata Smart Flash Cache write-back cache provides acceleration to datafile writes as well as reads and therefore reduces the chance of free buffer wait bottlenecks.

The figure below  illustrates the effectiveness of the write-back cache for workloads that encounter free buffer waits. The workload used to generate this data was heavily write-intensive with very little read IO overhead (all the necessary read data was in cache). As a result, it experienced a very high degree of free buffer waits and some associated buffer busy waits. Enabling the write-back cache completely eliminated the free buffer waits by effectively accelerating the write IO bandwidth of the database writer. As a result, throughput increased four fold.

image

However, don’t be misled into thinking that the write-back cache will be a silver bullet for any workload.  Workloads that are experiencing free buffer waits are likely to see this sort of performance gain. Workloads where the dominant waits are for CPU, read IO, global cache co-ordination, log writes and so on will be unlikely to see any substantial benefit from the write-back cache.   


[1] 11.2.3.2.1 is recommended as the minimum version for this feature as it contains fixes to significant issues discovered in the initial release.

Exadata Smart Flash Logging–Outliers

In my last post, I looked at the effect of the Exadata smart flash logging.  Overall,  there seemed to be a slight negative effect on median redo log sync times.  This chart (slightly different from the last post because of different load and configuration of the system), shows how there’s a “hump” of redo log syncs that take slightly longer when the flash logging is enabled:

image

But of course, the flash logging feature was designed to improve performance not of the “average” redo log sync, but of the “outliers”. 

In my tests, I had 40 concurrent processes writing redo as fast as they could.  Occasionally this would result in some really long wait times.  For instance, in this trace you see an outlier of 291,780 microseconds (the biggest outlier in my tests BTW) within an otherwise unremarkable set of waits:

WAIT #47124064145648: nam='log file sync' ela= 1043 buffer#=101808 sync scn=1266588527 p3=0 obj#=-1 tim=1347583167588250
WAIT #47124064145648: nam='log file sync' ela= 2394 buffer#=130714 sync scn=1266588560 p3=0 obj#=-1 tim=1347583167590888
WAIT #47124064145648: nam='log file sync' ela= 932 buffer#=101989 sync scn=1266588598 p3=0 obj#=-1 tim=1347583167592057
WAIT #47124064145648: nam='log file sync' ela= 291780 buffer#=102074 sync scn=1266588637 p3=0 obj#=-1 tim=1347583167884090
WAIT #47124064145648: nam='log file sync' ela= 671 buffer#=102196 sync scn=1266588697 p3=0 obj#=-1 tim=1347583167885294
WAIT #47124064145648: nam='log file sync' ela= 957 buffer#=102294 sync scn=1266588730 p3=0 obj#=-1 tim=1347583167886575

To see if the flash logging feature was successful in removing these outliers, I extracted the top 10,000 waits from each of the roughly 8,000,000 waits I recorded in each category.  Here’s a plot (non-logarithmic) of those waits:

image

So – the flash log feature was effective in eliminating or at least reducing very extreme outlying redo log sync times.    Most redo log sync operations will experience no improvement or maybe even a slight degradation. But for the small number of log syncs that would have experienced a really excessive delay, the feature works as advertised – it reduces the chance of really excessive log file syncs. 

In my opinion, this effect doesn't imply that the flash can process a redo log write faster than the magnetic disks - in fact probably the opposite is true.  But given two desitinations to choose from, we avoid really long delays that occur when one of the destinations only is overloaded. 

Exadata smart flash logging

Exadata storage software 11.2.2.4 introduced the Smart flash logging feature.  The intent of this is to reduce overall redo log sync times - especially outliers - by allowing the exadata flash storage to serve as a secondary destination for redo log writes.  During a redo log sync, Oracle will write to the disk and flash simultaneously and allow the redo log sync operation to complete when the first device completes. 

Jason Arneil reports some initial observations here, and Luis Moreno Campos summarized it here.

I’ve reported in the past on using SSD for redo including on Exadata and generally I’ve found that SSD is a poor fit for redo log style sequential write IO.  But this architecture should at least do now harm and on the assumption that the SSD will at least occasionally complete faster than a spinning disk I tried it out. 

My approach involved the same workload I’ve used in similar tests.  I ran 20 concurrent processes each of which performed 200,000 updates and commits – a total of 4,000,000 redo log sync operations.  I captured every redo log sync wait from 10046 traces and loaded them in R for analysis.

I turned flash logging on or off by using an ALTER IORMPLAN command like this (my DB is called SPOT):

ALTER IORMPLAN dbplan=((name='SPOT', flashLog=$1),(name=other,flashlog=on))'

And I ran “list metriccurrent where objectType='FLASHLOG'” before and after each run so I could be sure that flash logging was on or off.

When flash logging was on, I saw data like this:

Before:

     FL_DISK_FIRST                     FLASHLOG     32,669,310 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,318,741 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests

After:

      FL_DISK_FIRST                     FLASHLOG     33,201,462 IO requests
     FL_FLASH_FIRST                    FLASHLOG     7,337,931 IO requests
     FL_PREVENTED_OUTLIERS             FLASHLOG     774,146 IO requests

 

So for this particular cell the flash disk “won” only 3.8% of times (7,337,931-7,318,741)*100/(7,337,931-7,318,741+33,201,462-32,669,310) and prevented no “outliers”.  Outliers are defined as being redo log syncs that would have taken longer than 500 ms to complete. 

Looking at my 4 million redo log sync times,  I saw that the average and median times where statistically significantly higher when the smart flash logging was involved:

> summary(flashon.data$synctime_us) #Smart flash logging ON
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
    1.0   452.0   500.0   542.4   567.0  3999.0
> summary(flashoff.data$synctime_us) #Smart flash logging OFF
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   29.0   435.0   481.0   508.7   535.0  3998.0
> t.test(flashon.data$synctime_us,flashoff.data$synctime_us,paired=FALSE)

    Welch Two Sample t-test

data:  flashon.data$synctime_us and flashoff.data$synctime_us
t = 263.2139, df = 7977922, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
33.43124 33.93285
sample estimates:
mean of x mean of y
542.3583  508.6763

Plotting the distribution of redo log sync times we can pretty easily see that there’s actually a small “hump” in times when flash logging is on (note logarithmic scale):

image

This is of course the exact opposite of what we expect, and I checked my data very carefully to make sure that I had not somehow switched samples.  And I repeated the test many times and always saw the same pattern.  

It may be that there is a slight overhead to running the race between disk and flash, and that that overhead makes redo log sync times slightly higher.  That overhead may become more negligible on a busy system.  But for now I personally can’t confirm that smart flash logging provides the intended optimization and in fact I observed a small but statistically significant and noticeable degradation in redo log sync times when it is enabled.

Using SSD for redo on Exadata - pt 2

In my previous post on this topic, I presented data showing that redo logs placed on an ASM diskgroup created from exadata griddisks created from flash performed far worse than redo logs placed on ASM created from spinning SAS disks.

Of course, theory predicts that flash will not outperform spinning magnetic disk for the sequential write IOs experienced by redo logs, but on Exadata, flash disk performed much worse than seemed reasonable and worse than experience on regular Oracle with FusionIO SSD would predict (see this post).

Greg Rahn and Kevin Closson were both kind enough to help explain this phenomenon.  In particular, they pointed out that the flash cards might be performing poorly because of the default 512 byte redo block size and that I should try a 4K blocksize.   Unfortunately, at least on my patch level (11.2.2.3.2), there appears to be a problem with setting a 4K blocksize

ALTER DATABASE add logfile thread 1 group 9 ('+DATA_SSD') size 4096M blocksize 4096
*
ERROR at line 1:
ORA-01378: The logical block size (4096) of file +DATA_SSD is not compatible with the disk sector size (media sector size is 512 and host sector size is 512)

According to Greg, the F20 SSD cards are incorrectly reporting their physical characteristics and this is fixed in the current patch level.   Luckily, you can override the check by setting

ALTER SYSTEM SET "_disk_sector_size_override"=TRUE SCOPE=BOTH;

Greg and Kevin really know their stuff:  setting a 4k redo log block size resulted in dramatic improvements to redo log throughput – elapsed time reduced by 70%:

image

As expected,  redo log performance for SSD still slightly lags that of SAS spinning disks.     It’s clear that you can’t expect a performance improvement by placing redo on SSD, but at least the 4K blocksize fix makes the response time comparable.  Of course, with the price of SSD being what it is, and the far higher benefits provided for other workloads – especially random reads – it’s hard to see an economic rationale for SSD-based redo.    But at least with a 4K blocksize it’s tolerable.

When our Exadata system is updated to the latest storage cell software, I’ll try comparing workloads with the Exadata smart flash logging feature.

Using flash disk for Redo on Exadata

In this Quest white paper and on my SSD blog here,  I report on how using a FusionIO flash SSD compares with SAS disk for various configurations – datafile, flash cache, temp tablespace and redo log.  Of all the options I found that using flash for redo was the least suitable, with virtually no performance benefit:

image

That being the case,  I was surprised to see that Oracle had decided to place Redo logs on flash disk within the database appliance, and also that the latest release of the exadata storage cell software used flash disk to cache redo log writes (Greg Rahn explains it here).   I asked around at OOW hoping someone could explain the thinking behind this, but generally I got very little insight.

I thought I better repeat my comparisons between spinning and solid state disk on our Exadata system here at Quest.  Maybe the “super capacitor” backed 64M DRAM on each flash chip would provide enough buffering to improve performance.  Or maybe I was just completely wrong in my previous tests (though I REALLY don’t think so :-/).

Our Exadata 1/4 rack has a 237GB disk group constructed on top of storage cell flash disk.  I described how that is created in this post.   I chose 96GB per storage cell in order to allow the software to isolate the grid disks created on flash to 4 24GB FMODs (each cell has 16 FMODs).    Our Exadata system has fast SAS spinning disks – 12 per storage cell for a total of 36 disks.  Both the SAS and SSD disk groups had normal redundancy.

I ran an identical redo-intensive workload on the system using SAS or SSD diskgroups for the redo logs.  Redo logs were 3 groups of 4GB per instance.   I ran the workload on it’s own, and as10 separate concurrent sessions.  

The results shocked me:

image

When running at a single level of concurrency,  the SSD based ASM redo seemed to be around 4 times slower than the default SAS-based ASM redo.  Things got substantially worse as I upped the level of concurrency with SSD being almost 20 times slower.  Wow.

I had expected the SAS based redo to win – the SAS ASM disk group has 36 spindles to write to, while the SSD group is (probably) only writing to 12 FMODs.  And we know that we don’t expect flash to be as good as SAS disks for sequential writes.  But still, the performance delta is remarkable. 

Conclusion

 

I’m yet to see any evidence that putting redo logs on SSD is a good idea, and I keep observing data from my own tests indicating that it is neutral at best and A Very Bad Idea at worse.  Is anybody seeing any similar?  Does anybody think there’s a valid scenario for flash-based redo?

Configuring Exadata flash as grid disk

The default – or at least a very common - configuration for Exadata is to configure all the flash as Exadata Smart Flash Cache (ESFC).   This is a simple and generally performant configuration, but won’t be the best choice for all cases.  In particular, if you have table which is performance critical, and it could fit in the flash storage you have available, you might be better off configuring some of your flash as grid disk, creating an ASM disk group from that, and putting the table there.

Here’s the procedure:

1. Drop the flash cache, create a new flashcache of a smaller size, then create the griddisks from the unallocated space.  These CELLCLI commands do that:

CellCLI> drop flashcache
Flash cache exa1cel01_FLASHCACHE successfully dropped
CellCLI> create flashcache all size=288g
Flash cache exa1cel01_FLASHCACHE successfully created
CellCLI> create griddisk all flashdisk prefix=ssddisk

There’s 384G of flash on each storage cell, so the above commands create about 96G of SSD grid disk.   Run those commands on each cell node, perhaps by using the CCLI command (see this post for an example).

2. The above procedure will create disks in the format o/cellIpAddress/ssddisk_FD_*_cellnode.  Log into an ASM instance, and issue the following command to create a diskgroup from those disks:

SQL> 
  1  create diskgroup DATA_SSD normal redundancy disk 'o/*/ssddisk*'
  2  attribute 'compatible.rdbms'='11.2.0.0.0',
  3  'compatible.asm'='11.2.0.0.0',
  4  'cell.smart_scan_capable'='TRUE',
  5* 'au_size'='4M'

Alternatively you can use the database control for the ASM instance to create the new diskgroup.  Your new flash disks should show up as candidate disks.

The relative performance of flash disks, vs flash cache is similar in Exadata to what I’ve seen using the Database flash cache.  Placing an object directly on flash is faster than using the cache, although the cache is very effective.  Here’s the results for 200,000 primary key lookups across  1,000,000 possible primary keys:

image

Clearing the Exadata smart flash cache using dcli

I’ve been doing some performance benchmarks on our exadata box specifically focusing on the performance of the smart flash cache.  I found that even if I switched the CELL_FLASH_CACHE storage setting to NONE,  the flash cache will still keep cached blocks in flash and would therefore give me artificially high values for “cell flash cache read hits” statistic when I set CELL_FLASH_CACHE back to DEFAULT or KEEP.  What I needed was a way to flush the Exadata flash cache.

Unfortunately there doesn’t seem to be a good way to flush the flash cache – no obvious CELLCLI command.   Maybe I’ve missed something, but for now I’m dropping and recreating the flash cache before each run.

Luckily the dcli command lets me drop and recreate on each cell directly from the database node and even sets up passwordless connections.  Here’s how to do it.

Firstly, create a script that will drop and recreate the flash cache for a single cell:

$ cat flushcache.sh
cellcli <<!
drop flashcache;
create flashcache all;

Now, use ccli to execute that on each cell node (I have three named exa1cel01,exa1cel02,exa1cel03:

$ dcli -c exa1cel01,exa1cel02,exa1cel03 --serial -k -l userid -x flushcache.sh

The “-k” option copies the ssh key to the cell nodes which means that after the first execution you’ll be able to do this without typing in the password for each cell node.   The “—serial” option makes each command happen one after another rather than all at once – you probably don’t need this…

Anyone know a better way to flush the Exadata flash cache?

Best Practices for Optimizing Oracle RDBMS with Solid State Disk

I've been doing research onto the best use of flash SSD with the Oracle RDBMS over the past year and Quest has produced a whitepaper summarizing the findings.  You can get it here

Understanding the performance dynamics of SSD is critical in modern Oracle performance management.  I'm organizing my work on SSD Oracle performance into an online reference that I hope to continually update as I learn more. 

Finally, I'll be speaking on this topic at Oracle Open World in just a few weeks - here are the  session details:

Session: 3841
Title: Making the Most of Solid-State Disk in Oracle Database 11g
Time Tuesday, 05:00 PM, InterContinental - InterContinental Ballroom A-

 

 

 

 

Accelerating Oracle database performance with SSD

I was recently asked to provide advice on how best to accelerate Oracle database performance with flash SSD.   In this post I’ll attempt to outline some thoughts on that.

There’s some thinking that SSD will soon replace spinning disk altogether - we just have to wait for the price to come down.  However, I’m not so sure that this complete replacement is likely to occur in the immediate future.   There are two competing forces at work:

  • The economies of “big data”, which drive us to reduce the cost per unit of storage ($$/TB):   magnetic disk costs only about 5% of the cost of a high end SSD per terabyte.
  • The economies of high IO throughput, in which we are motivated to reduce the cost of providing a given IO rate.  SSD (PCi flash) can generate IO rates at 1/20th of the cost of magnetic disks.

These two competing economies are not likely to change in the immediate future:  magnetic disk can store large amounts of data much more cheaply than SSD;  SSD can deliver very high IO rates more cheaply than magnetic disk.  For most databases, it will not be cost effective to place the entire database on SSD – the best outcome will be obtained when we place parts of the database on SSD and some on magnetic disk.

SSD Performance basics

I’ve posted in the past on how SSD works with the Oracle flash cache,  and I’ve presented on SSD performance at OOW.  You might like to review those items.  However,  here’s a quick review of the basics:

  • All flash drives offer pretty good read performance – say 25 microseconds for a single page read
  • When inserting data into an empty page, performance is slower, but not awful – around 250 microseconds
  • In order to update an existing page, a block erase is required – much, much slower – maybe 2000 microseconds

image

Enterprise SSD vendors - Fusion IO, Virident, etc - all have sophisticated algorithms to avoid the write penalty associated with SSD block erase operations.  Amongst other techniques, they will maintain a pool of unused blocks.  When a page needs to be updated it will be marked as invalid in the original location and moved to one of these blocks.  Later on, garbage collection routines will clear up the invalid entries.  The result is that updates don't always have to incur the high overhead of block erase operations.

Nevertheless, you want to avoid placing write intensive files on a flash disk, because as the disk fills up with modified blocks - and especially if the write rate exceeds the garbage collection capabilities - you might see the disk slow down dramatically.

To summarize:

Flash-based disks perform reads much faster than writes, and can suffer from performance degradation if subjected to intensive sustained write operations.

 

Options for SSD deployment on Oracle

 

Given the performance characteristics of SSD,  how best to use SSD to boost Oracle performance?  There are a few options:

  • Put the entire database on SSD
  • Place selected segments on SSD
  • Use the 11GR2 flash cache
  • Put temporary tablespace on SSD
  • Put the redo logs on SSD

Let’s look at each of these in turn:

Put the entire Database on SSD

This works great – if you can afford it.  For most databases the cost of putting it all on SSD is too high.  Placing data that is rarely accessed on SSD is not very cost effective, because you are paying a high cost per GB, but not getting any benefit from the relatively cheap IO rates. 

Place selected segments on SSD

This is probably the best option if you don’t have enough SSD for the entire database, but it does require a bit of configuration.   We identify the segments with the most IO,  perhaps using a query on V$SEGMENT_STATISTICS:

 

We are looking for object that have a high read/write ratio, contribute to a significant number of reads, and are small enough to fit on our SSD.   We create a tablespace on SSD, then move these objects to that tablespace.

If our most read intensive tables are massive, then they might not fit on the SSD.  In this case we could consider partitioning them and placing the hottest partitions on SSD.  

Use the 11GR2 flash cache

If you have 11GR2 and are on a compatible platform (Solaris or Oracle Enterprise Linux), then you can setup the Oracle DB flash cache.  This is absolutely the easiest option – all you need to do is set a few parameters and bounce the database.  The flash cache automatically caches the hottest blocks and usually provides an excellent performance improvement.

The improvement is not as significant as moving objects directly onto flash storage, since there still has to be magnetic disk IO getting things on and off.  Also, the architecture tends to create a lot of writes to the SSD, which can challenge garbage collection.  The chart below sums up a typical outcome – a huge boost from the flash cache, but not as big a boost as we could get by putting objects directly on SSD:

image

Put temporary tablespace on SSD

 

Temporary tablespace IO (from sorts and hash operations) can sometimes be the most significant form of database IO, and often the entire temporary tablespace will fit on the SSD.  So should we consider putting temporary tablespace on the SSD?

I’m a bit reluctant to do this:  by definition temporary tablespace IO is equal parts reads and writes, and since we are motivated to avoid write IO to the SSD, this seems like a dubious option.  However, if your performance is absolutely dominated by temp tablespace IO and your SSD is tolerant of high write rates (eg has excellent garbage collection,etc) then it might be worth trying.  After all,  the SSD will definitely speed up the read side of temporary tablespace IO, even if the write IO does not get a significant boost. 

Put redo logs on SSD

Although redo logs are very IO intensive, it’s almost all write IO and therefore this is probably not the best option.   During sustained writes the Garbage Collection will probably break down and the disk might not end up performing that much better than a spinning disk.  I'd not recommend this. 

Conclusions

 

You can use SSD to boost database performance even if you don’t have enough for the entire database.  Best options are:

  1. Place segments with high read but low write rates directly on SSD
  2. Use the SSD with the 11GR2 database flash cache (if you are on OEL or Solaris)
  3. If temporary segment IO is absolutely your bottleneck – but only then – consider placing the temporary tablespace on SSD.

I don’t think using SSD for redo logs is a good idea.  The high rates of sustained sequential write IO is not the ideal type of IO for SSD.

Disabling the Intel X-25 E write cache

In my last posting on using SSD with Oracle, I said how impressed I was with the X-25 E SSD write performance.  However, at the OReilly MySQL conference last month, I attended a talk by Vadim Tkachenko and Morgan Tocker from Percona on An Overview of Flash Storage for Databases.  It was a great talk overall, but one important thing I learned is that the X-25 E has a volatile 64MB write cache.  What this means is that the X-25 can report that a block is written to disk when it is still within a RAM buffer within the device.  If the disk failed between the write to RAM and the write to flash then the data could be lost.
Read More

Flash tablespace vs. DB Flash Cache

In this post I'm going to report on some performance comparisons I've recently conducted on using SSD flash storage for datafiles vs. using the new Oracle 11GR2 database flash cache.  

It sometimes seems like I’ve been waiting for the the end of the spinning magnetic disk all my professional  life.   The technology is so ancient, so clearly limited and (argh) mechanical.  So the fact that Solid State Disk (SSD) is becoming more an more practical as a database storage medium - and directly supported by Oracle in 11GR2 - is very exciting.  

Using SSD as part of database storage can certainly yield big results, but it's important to understand the performance characteristics of flash SSD and make sure we don't use it inappropriately. 

Read More

More on the database flash cache

I'm eagerly awaiting my high-performance flash SSD (an Intel X-25 E), but in the meantime I've played a bit more with the database flash cache on the cheap hardware setup described in my last post.   Sometimes it can be useful to test new features on slow hardware, since you see phenomenon that don't occur when everything is running full speed.

I originally naively imagined that blocks would be copied into the flash cache by the Oracle server process .  Eg, that if I read from disk, I deposit the block in both the buffer cache or the flash cache.  However,  upon investigation it appears that blocks are moved from the buffer cache to the flash cache by the DBWR as they are about to be flushed from the buffer cache.  

This is of course, a much better approach.  The DBWR can write to the flash cache asynchronously, so that user sessions get the benefit - less time reading from magnetic disk - without having to wait while blocks are inserted into the flash cache. 

 

So the lifecycle of a block looks something like this:

Read More

Using the Oracle 11GR2 database flash cache

Oracle just released a patch which allows you to use the database flash cache on Oracle Enterprise Linux even if you don't have exadata storage.  The patch is the obscurely named:

  •    8974084:META BUG FOR FLASH CACHE 11.2PL BUGS TO BACKPORT TO 11.2.0.1 OEL

Once you install the patch you can use any old flash device as a database flash cache.  Below I've documented some initial dabbling on a very old server and a cheap usb flash device.   The results are not representative of the performance you'd get on quality hardware, but are still interesting, I think.

Read More