Wednesday, September 2, 2015

statistics collection in Teradata 14.0

Teradata 14.0 offers some very helpful enhancements to the statistics collection process.   This posting discusses a few of the key ones, with an explanation of how these enhancements can be used to streamline your statistics collection process and help your statistics be more effective.
For more detail on these and other statistics collection enhancements, please read the orange book titled Teradata 14.0 Statistics Enhancements, authored by Rama Korlapati, Teradata Labs. 

New USING options add greater flexibility

In Teradata 14.0 you may optionally specify a USINGclause within the collect statistics statement.  As an example, here are the 3 new  USINGoptions that are available in 14.0 with parameters you might use:
1
2
3
4
5
. . . USING MAXINTERVALS 300
 
. . . USING MAXVALUELENGTH 50
 
. . . USING SAMPLE 10 PERCENT
MAXINTERVALS allows you to increase or decrease the number of intervals one statistic at a time in the new version 5 statistics histogram.  The default maximum number of intervals is 250.  The valid range is 0 to 500.  A larger number of intervals can be useful if you have widespread skew on a column or index  you are collecting statistics on, and you want more individiual high-row-count values to be represented in the histogram.   Each statistics interval highlights its single most popular value, which is designates as its “mode value” and lists the number of rows that carry that value.  By increasing the number of intervals,  you will be providing the optimizer an accurate row count for a greater number of popular values.
MAXVALUELENGTH lets you expand the length of the values contained in the histogram for that statistic.  The new default length is 25 bytes, when previously it was 16.  If needed, you can specify well over 1000 bytes for a maximum value length.  No padding is done to the values in the histogram, so only values that actually need that length will incur the space (which is why the parameter is named MAXVALUELENGTH instead of VALUELENGTH).   The 16-byte limit on value sizes in earlier releases was always padded to full size.  Even if you statistics value was one character, you used the full 16 bytes to represent it.     
Another improvement around value lengths has to do with multicolumn statistics.  In earlier releases the 16 byte limit for values in the intervals was taken from the beginning of the combined value string, and truncation took place at the end.  Truncation will still take place at the end of the combined string in 14.0, however in this more current release you can support a much longer combined value.   In addition, during the aggregation process that builds up the statistical histogram, each column within the statistic will be able to have its its first 32 bytes represented when determining the number of distinct values, ensuring that a significant  number of bytes in each column can contribute to understanding how distinct the combination of column values are.
SAMPLE n PERCENTallows you to specify sampling at the individual statistics collection level, rather than at the system level.  This allows you to easily apply different levels of statistics sampling to different columns and indexes.  
Here's an example of how this USINGsyntax might look:
COLLECT STATISTICS
        USING MAXVALUELENGTH 50
        COLUMN ( P_NAME )
ON CAB.product;

Combining multiple collections in one statement

Statistic collection statements for the same table that share the same USINGoptions, and that request full statistics (as opposed to sampled), can now be grouped syntactically.  In fact it is recommended that once you get on 14.0 that you collect all such statistics on a table as one group.  The optimizer will then look for opportunities to overlap the collections, wherever possible, reducing the time to perform the statistics collection and the resources it uses. 
Here is an example
The old way:
COLLECT STATISTICS COLUMN
(o_orderdatetime,o_orderID)
ON Orders;
COLLECT STATISTICS COLUMN
(o_orderdatetime)
ON Orders;
COLLECT STATISTICS COLUMN
(o_orderID)
ON Orders;
The new, recommended way: 
COLLECT STATISTICS
  COLUMN (o_orderdatetime,o_orderID)
, COLUMN (o_orderdatetime)
, COLUMN (o_orderID)
ON Orders;
This is particularly useful when the same column appears in single and also multicolumn statistics, as in the example above.   In those cases the optimizer will perform the most inclusive collection first (o_orderdatetime,o_orderID), and then re-use the spool built for that step to derive the statistics for the other two columns.  Only a single table scan is required, instead of 3 table scans using the old approach.  
Sometimes the optimizer will choose to perform separate collections (scans of the table) the first time it sees a set of bundled statistics. But based on demographics it has available from the first collection, it may come to understand that it can group future collections and use pre-aggregation and rollup enhancements to satisfy them all in one scan.   
But you have to remember to re-code your statistics collection statements when you get on 14.0 in order to experience this savings.

Summary statistics

New in Teradata 14.0, table-level statistics known as “summary statistics”  are collected alongside of the column or index statistics you request.  Summary statistics do not cause their own histogram to be built, but rather they create a short listing of facts about the table undergoing collection that are held in the newDBC.StatsTbl. It is a very fast operation.   Summary stats report on things such as the table’s row count, average block size, and some metrics around block level compression and (in the future) temperature.  An example of actual execution times that I ran are shown below, comparing regular column statistics collection against summary statistics collection for the same large table.  Time is reported in MM:SS:
1
2
COLLECT STATISTICS ON Items COLUMN I_ProductID;
 Elapsed time (mm:ss):  9:55 
1
2
COLLECT SUMMARY STATISTICS ON Items;
 Elapsed time (mm:ss):  00:01
You can request summary statistics for a table, but even if you never do that, each individual statistics collection statement causes summary stats to be gathered.  For this reason, it is recommended that you group your statistics collections against the same table into one statement, in order to avoid even the small overhead involved in building summary stats repetively for the same table within the same script. 
There are several benefits in having summary statistics.  One critical advantage is that the optimizer now  uses summary stats to get the most up-to-date row count from the table in order to provide more accurate extrapolations.  It no longer needs to depend on primary index or PARTITIONstats, as was the case in earlier releases,  to perform good extrapolations when it finds statistics on a table to be stale.
Here’s an example of what the most recent summary statistic for the Items table looks like:
SHOW SUMMARY STATISTICS VALUES ON Items;
 
COLLECT SUMMARY STATISTICS
            ON CAB.Items
            VALUES
(
 /** TableLevelSummary **/
 /* Version           */ 5,
 /* NumOfRecords      */ 50,
 /* Reserved1         */ 0.000000,
 /* Reserved2         */ 0.000000,
  
 /* SummaryRecord[1] */
 /* Temperature           */ 0,
 /* TimeStamp             */ TIMESTAMP '2011-12-29 13:30:46',
 /* NumOfAMPs             */ 160,
 /* OneAMPSampleEst       */ 5761783680,
 /* AllAMPSampleEst       */ 5759927040,
 /* RowCount              */ 5759985050,
 /* DelRowCount           */ 0,
 /* PhyRowCount           */ 5759927040,
 /* AvgRowsPerBlock       */ 81921.871617,
 /* AvgBlockSize          */ 65024.000000,
 /* BLCPctCompressed      */ 0.00,
 /* BLCBlkUcpuCost        */ 0.000000,
 /* BLCBlkURatio          */ 0.000000,
 /* RowSizeSampleEst      */ 148.000000,
 /* Reserved2             */ 0.000000,
 /* Reserved3             */ 0.000000,
 /* Reserved4             */ 0.000000
);