Correlation probability in Oracle
Being at the OReilly Strata conference has re-energized my inner statistic geek, so I thought I’d share a PL/SQL package I some years ago to calculate the probability associated with correlation coefficients calculated by Oracle.
Correlation is a statistical measure of the association between two variables. Oracle introduced direct support for correlation in the SQL language from at least 10g (may have even been 9i). So we can write SQL statements that calculate correlation. So for instance, this statement generates the correlation between the number of sorts and the elapsed times of SQL statements still cached in the SQL area:
The correlation coefficient “r” will be 1 if the two variables always increase or decrease together and –1 if one variable always increases as the other decreases. A correlation coefficient of .33 probably represents some relationship, but anyone who has done statistics knows that you should calculate the statistical significance – chance that the result did not arise from chance – before making assertions.
Oracle doesn’t report correlation probabilities, so I implemented the formula to calculate the probability in a PLSQL package. It’s here.
If you pass the correlation coefficient and the number of samples (rows input to the CORR function) into the CORR_PROG_PKG.PROB_R function then it will show you the probability associated with that correlation. So below we see the probability that this correlation is due to random fluctuation is effectively 0 (down to 4 significant figures):
Generally if the probability is less than .05 or .01 (one in 20 or 1 in 100 respectively) , then we would refer to the correlation as being statistically significant and that the two variables are related. So we would be justified in claiming there was a relationship between sorts and elapsed time.
Just for kicks, let’s use this technique to test the hypothesis that the Buffer Cache Hit ratio is a reliable indicator of performance. Here’s the result:
The correlation is very low and the probability that there is no association is very high – way above the .05 probability. In other words, in this database there is no statistically significant relationship between the elapsed time of an SQL and it’s buffer cache hit rate!