Optimizing GROUP and ORDER BY

    

Starting with Oracle 10.2,  you may notice a significant degradation in relative performance when you combine a GROUP BY with an ORDER BY on the same columns. 

Oracle introduced a hash-based GROUP BY in 10.2.  Previously, a GROUP BY operation involved sorting the data on the relevent columns, then accumulating aggregate results.   The hash GROUP BY method creates the aggregates without sorting, and is almost always faster than sort-based GROUP BY.  Unfortunately, when you include an ORDER BY clause on the same columns as those in the GROUP BY Oracle reverts to the older sort-based GROUP BY with a corresponding drop in performance.   However, you can reword your SQL to take advantage of the hash-based GROUP BY while still getting your data in the desired order.

For example, consider this simple statement:

Prior to 10.2, the statement would be executed using a SORT GROUP BY operation:

 

 

From 10.2 onwards, we can expect to see the HASH GROUP BY :

As Alex Gorbachev  noted,  the new GROUP BY can return bad results in early versions (prior to 11.1.0.7 or 10.2.0.4).  You can disable by setting the parameter _GBY_HASH_AGGREGATION_ENABLED to FALSE.  Below, we use the OPT_PARAM hint to set this for an individual SQL;  you could also use ALTER SESSION or ALTER SYSTEM to change the parameter at the session or instance level:

 

The other thing to remember is that you can't rely on GROUP BY returning rows in order;  prior to 10.2 GROUP BY  would usually return the rows in the GROUP BY order, and some of us may have consequently not bothered to add an ORDER BY clause.  On upgrading to 10.2, you might have been surprised to see queries suddenly return data in apparently random order.  Tom Kyte talks about this here:  as a general rule you should never  rely on a side-effect to get rows in a particular order.  If you want them in order, you should always specify the ORDER BY clause.

Generally speaking, the new hash GROUP BY is much more efficient than the older sort method.   Below we see the relative performance for the two GROUP BY algorithms when grouping a sample table of 2.5 million rows into about 200,000 aggregate rows:

Your results may vary of course, but I've not seen a case where a SORT GROUP BY outperformed a HASH GROUP BY. 

Unfortunately,  Oracle declines to use the HASH GROUP BY in some of the circumstances in which it might be useful.  It's fairly common to have a GROUP BY and an ORDER BY on the same columns.  After all, you  usually don't want an aggregate report to be in random order. 

When Oracle uses the SORT GROUP BY, the rows are returned in the grouping order as a side effect of the sort.  So in the example below, there is just one SORT - it supports both the GROUP BY and ORDER BY (note that i've turned off the HASH GROUP BY using the OPT_PARAM hint):

 

The above plan is what you'd expect to see prior to 10.2 - since the HASH GROUP BY is not available in that release. 

However, when we examine the execution plan in 11g or 10.2  we find that Oracle still chooses the SORT GROUP BY:

 

Here's the important point:

When you combine a GROUP BY and an ORDER BY on the same column list,  Oracle will not use the HASH GROUP BY option.

 

Presumably, the optimizer "thinks" that since the SORT GROUP BY allows Oracle to get the rows in sorted order while performing the aggregation, it's best to use SORT GROUP BY when the SQL requests an ORDER BY as well as the GROUP BY.  However, this logic is seriously flawed.  The inputs to the ORDER BY will normally be far fewer rows than the inputs in to the GROUP BY.  In our example above, the GROUP BY processes about 2.5 million rows, while the ORDER BY sorts only about 200,000 rows:  it really doesn't make sense to de-optimize the expensive GROUP BY to optimize a relatively cheap ORDER BY. 

Is there a way to force Oracle to use the HASH GROUP BY even if we have the ORDER BY?  I'm not aware of an optimizer parameter or hint, but I was able to persuade Oracle to use the HASH GROUP BY by putting the GROUP BY in a subquery, the ORDER BY in the outer query, and using the NO_MERGE hint to avoid having the subquery merged into the outer query.  Here's my query and execution plan showing that I get a HASH GROUP BY together with SORT ORDER BY:

 

 

You might think that doing a single SORT GROUP BY is better than doing both a HASH GROUP BY and a SORT ORDER BY.  But remember,the SORT ORDER BY only has to sort the grouped rows - about 200,000 in my example - while the GROUP BY has to process the entire contents of the table - about 2.5 million in my sample table.  So optimizing the GROUP BY is often more important than avoiding a small second sort.

Here's a comparison of performance between the two approaches:

The rewrite reduced elapsed time by about 2/3rds.

Conclusion

When a GROUP BY is associated with an ORDER BY on the same columns,  the Oracle optimizer may choose a SORT GROUP BY rather than the usually more efficient HASH GROUP BY.  Using the SORT GROUP BY avoids adding a SORT ORDER BY to the plan, but the overall result is usually disappointing.

To get a better result, you can perform the GROUP BY in an in-line view and perform the ORDER BY in the outer query.  Use the NO_MERGE hint to prevent the two operations from being combined.