Tuesday, October 27, 2015

TD14.10 Statistics : “Threshold” option

TD14.10 Statistics : “Threshold” option

When you submit a COLLECT STATISTICS statement in 14.10, it may or may not execute.  A decision is made whether or not there is a value in recollecting these particular statistics at the time they are submitted.  That decision is only considered if threshold options are being used.

There are three levels of Threshold options

System threshold:   This is the default approach for applying thresholds for all 14.10 platforms.  This is default approach determines the appropriate threshold for each statistic and considers how much the underlying table has changed since the last collection.
The degree of change to the table since the last collection is compared against the current state of the table, based on USECOUNT tracking of inserts, deletes and updates performed.   If the change threshold has not been reached, and enough history has been collected for this statistics so that the optimizer can perceive a pattern in the data such that extrapolations can be confidently performed, then this statistics collection will be skipped.
       There is a DBS Control record parameter called SysChangeThresholdOption

DBA-defined global thresholds:  This thresholds override the system default, and rely on DBA-defined fixed percentages as thresholds.   Once set, all statistics collection statements will use these global threshold values, unless overridden by the third level of threshold options at the statement level.
Two parameters :
DefaultUserChangeThreshold:
DefaultTimeThreshold:

Thresholds on individual statements:  Optional USING clauses that are attached to COLLECT STATISTICS statements can override the system default or any global DBA-defined thresholds.

COLLECT STATISTICS USING THRESHOLD 8% AND THRESHOLD 10 DAYS
COLUMN column1 ON Table1;


Whichever threshold level is set, if the optimizer determines that the threshold has not been met, no statistics will be collected, even though they have been submitted.  When a collection has been asked for but has not been executed, a StatsSkipCount column in the DBC.StatsV row that represents this statistics will be incremented.  


DBQL Usecount and Threshold

The recommended percent of change thresholds rely on having DBQL USECOUNT logging turned on. USECOUNT tracks inserts, deletes and updates made to tables within a database, and as a result, can provide highly accurate information to the optimizer about how the table has changed since the last statistics collection.

The default system threshold functionality is able to be applied to a statistic collection only if USECOUNT logging has been enabled for the database that the statistics collection table belongs to.   In the absence of USECOUNT data, the default threshold behaviour will be ignored. However, both DBA-defined global thresholds and statement-based thresholds are able to use percent of change thresholds even without USECOUNT logging, but with the risk of less accuracy.

 In the cases where USECOUNT logging is not enabled, percent of change values are less reliable because the optimizer must rely on random AMP sample comparisons.
Percent of change is the recommended way to express thresholds.


Example :
Pick a small, non-critical database.
Enable DBQL USECOUNT logging on that database:
BEGIN QUERY LOGGING WITH USECOUNT ON TestDB;
Disable the system threshold parameter by setting DBS Control setting:
SysChangeThresholdOption = 3

Leave the global parameters disabled as they are by default:
DefaultUserChangeThreshold = 0

DefaultTimeThreshold = 0

Add USING THRESHOLD 10 PERCENT to the statistics collection statements just for the tables within your selected database:
COLLECT STATISTICS USING THRESHOLD 10% COLUMN TestID ON TestDB;
Insert a few rows into the table (less than 10% of the table size) and run an Explain of the statistics collection statement itself, and it will tell you whether or not skipping is taking place.
EXPLAIN COLLECT STATISTICS COLUMN TestID ON TestDB;

Monday, October 26, 2015

Monitor Flow Control

Monitor flow control state:

Refer table DBC.ResUsageSAWT to monitor flow control state also consider snapshot interval (10 mins default)


Column Definition of ResUsageSAWT

TheDate: Returns the calendar date on which the event was logged.

NodeID: Returns Node ID.

TheTime: Returns the time of the event in theform hh:mm:ss.

VprId: Identifies the AMP vproc.

Active: The Active controls whether or not the rows will be logged to the resource usage tables if Active Row Filter Mode is enabled.

MailBoxDepth: Current depth of the AMP work mailbox at the end of the period.
High AWT usage does not necessarily mean that the system is in flow control. Only if all AMP worker tasks are in use and the message queues are filled, flow control state is initiated.
While often MailBoxDepth is > 0 if the AMP is in flow control state, it’s not always like this because this measure presents only the size of the message queue at the end of each snapshot interval. It could be, that the AMP was in flow control state during the snapshot interval, but the queue is empty at the end of the interval. Keep this in mind.

FlowControlled: Specifies if an AMP is in flow control. If the value is non-zero, then the AMP is in flow control.

FlowCtlCnt: Returns the number of times this log period that the node entered the flow control state from a non-flow controlled state.

FlowCtlTime: Returns total time, in milliseconds, that an AMP is in flow control.

InuseMax: Returns maximum number of AMP WorkerTasks in use at any one time during the log period.By checking the the InUseMax measure, you will see how the total system workload is consuming AMP worker tasks, which means: how loaded your system is

Available: Returns the number of unreserved AMP Worker Tasks from the pool that are not being used at the end of the interval.
Available shows the number of unused AMP worker tasks at the end of the snapshot interval. A consecutive number of snapshots having Available = 0 means that your system ran out of AMP worker tasks over a longer time frame.

AvailableMin: Returns the minimum number of unreserved AMP Worker Tasks available in the pool for each AMP for the logged period. In other words, It is measuring the minimum number of AMP worker tasks having been used at one point in time during the snapshot interval

AwtLimit: Returns the current setting for AMP Worker Task (for example, 80, 100, or so on) in the DBS Control MaxLoadAWT field.

WorkTypeInuse00 - 15: Returns current number of AMP WorkerTasks in use for each work type.

WorkTypeMax00 - 15: Returns maximum number of AMP WorkerTasks in use at one time during the log period for each work type for the VprId vproc.

SELECT
TheDate
,TheTime
,Secs
,VPRID as TheAMP
,MailBoxDepth
,FlowCOntrolled
,FlowCtlCnt
,InUseMax
,Available
,AvailableMin
,AWTLimit
,WorkTypeMax00 AS DispatcherStep
,WorkTypeMax01 AS Spawned_Level1
,WorkTypeMax02 AS Spawned_Level2
,WorkTypeMax03 AS InternalWork
,WorkTypeMax04 AS Recovery
,WorkTypeMax08 AS ExpeditedDispatcherStep
,WorkTypeMax09 AS ExpeditedSpawned_Level1
,WorkTypeMax10 AS ExpeditedSpawned_Level2
,WorkTypeMax12 AS AbortStep
,WorkTypeMax13 AS SpawnedWorkAbort
,WorkTypeMax14 AS UrgentInternalWork
,WorkTypeMax15 AS MostUrgentInterbalWork
FROM DBC.ResUsageSAWT
WHERE thedate = date -2
and FlowCtlTime > 0 ;


Wednesday, September 2, 2015

statistics collection in Teradata 14.0

Teradata 14.0 offers some very helpful enhancements to the statistics collection process.   This posting discusses a few of the key ones, with an explanation of how these enhancements can be used to streamline your statistics collection process and help your statistics be more effective.
For more detail on these and other statistics collection enhancements, please read the orange book titled Teradata 14.0 Statistics Enhancements, authored by Rama Korlapati, Teradata Labs. 

New USING options add greater flexibility

In Teradata 14.0 you may optionally specify a USINGclause within the collect statistics statement.  As an example, here are the 3 new  USINGoptions that are available in 14.0 with parameters you might use:
1
2
3
4
5
. . . USING MAXINTERVALS 300
 
. . . USING MAXVALUELENGTH 50
 
. . . USING SAMPLE 10 PERCENT
MAXINTERVALS allows you to increase or decrease the number of intervals one statistic at a time in the new version 5 statistics histogram.  The default maximum number of intervals is 250.  The valid range is 0 to 500.  A larger number of intervals can be useful if you have widespread skew on a column or index  you are collecting statistics on, and you want more individiual high-row-count values to be represented in the histogram.   Each statistics interval highlights its single most popular value, which is designates as its “mode value” and lists the number of rows that carry that value.  By increasing the number of intervals,  you will be providing the optimizer an accurate row count for a greater number of popular values.
MAXVALUELENGTH lets you expand the length of the values contained in the histogram for that statistic.  The new default length is 25 bytes, when previously it was 16.  If needed, you can specify well over 1000 bytes for a maximum value length.  No padding is done to the values in the histogram, so only values that actually need that length will incur the space (which is why the parameter is named MAXVALUELENGTH instead of VALUELENGTH).   The 16-byte limit on value sizes in earlier releases was always padded to full size.  Even if you statistics value was one character, you used the full 16 bytes to represent it.     
Another improvement around value lengths has to do with multicolumn statistics.  In earlier releases the 16 byte limit for values in the intervals was taken from the beginning of the combined value string, and truncation took place at the end.  Truncation will still take place at the end of the combined string in 14.0, however in this more current release you can support a much longer combined value.   In addition, during the aggregation process that builds up the statistical histogram, each column within the statistic will be able to have its its first 32 bytes represented when determining the number of distinct values, ensuring that a significant  number of bytes in each column can contribute to understanding how distinct the combination of column values are.
SAMPLE n PERCENTallows you to specify sampling at the individual statistics collection level, rather than at the system level.  This allows you to easily apply different levels of statistics sampling to different columns and indexes.  
Here's an example of how this USINGsyntax might look:
COLLECT STATISTICS
        USING MAXVALUELENGTH 50
        COLUMN ( P_NAME )
ON CAB.product;

Combining multiple collections in one statement

Statistic collection statements for the same table that share the same USINGoptions, and that request full statistics (as opposed to sampled), can now be grouped syntactically.  In fact it is recommended that once you get on 14.0 that you collect all such statistics on a table as one group.  The optimizer will then look for opportunities to overlap the collections, wherever possible, reducing the time to perform the statistics collection and the resources it uses. 
Here is an example
The old way:
COLLECT STATISTICS COLUMN
(o_orderdatetime,o_orderID)
ON Orders;
COLLECT STATISTICS COLUMN
(o_orderdatetime)
ON Orders;
COLLECT STATISTICS COLUMN
(o_orderID)
ON Orders;
The new, recommended way: 
COLLECT STATISTICS
  COLUMN (o_orderdatetime,o_orderID)
, COLUMN (o_orderdatetime)
, COLUMN (o_orderID)
ON Orders;
This is particularly useful when the same column appears in single and also multicolumn statistics, as in the example above.   In those cases the optimizer will perform the most inclusive collection first (o_orderdatetime,o_orderID), and then re-use the spool built for that step to derive the statistics for the other two columns.  Only a single table scan is required, instead of 3 table scans using the old approach.  
Sometimes the optimizer will choose to perform separate collections (scans of the table) the first time it sees a set of bundled statistics. But based on demographics it has available from the first collection, it may come to understand that it can group future collections and use pre-aggregation and rollup enhancements to satisfy them all in one scan.   
But you have to remember to re-code your statistics collection statements when you get on 14.0 in order to experience this savings.

Summary statistics

New in Teradata 14.0, table-level statistics known as “summary statistics”  are collected alongside of the column or index statistics you request.  Summary statistics do not cause their own histogram to be built, but rather they create a short listing of facts about the table undergoing collection that are held in the newDBC.StatsTbl. It is a very fast operation.   Summary stats report on things such as the table’s row count, average block size, and some metrics around block level compression and (in the future) temperature.  An example of actual execution times that I ran are shown below, comparing regular column statistics collection against summary statistics collection for the same large table.  Time is reported in MM:SS:
1
2
COLLECT STATISTICS ON Items COLUMN I_ProductID;
 Elapsed time (mm:ss):  9:55 
1
2
COLLECT SUMMARY STATISTICS ON Items;
 Elapsed time (mm:ss):  00:01
You can request summary statistics for a table, but even if you never do that, each individual statistics collection statement causes summary stats to be gathered.  For this reason, it is recommended that you group your statistics collections against the same table into one statement, in order to avoid even the small overhead involved in building summary stats repetively for the same table within the same script. 
There are several benefits in having summary statistics.  One critical advantage is that the optimizer now  uses summary stats to get the most up-to-date row count from the table in order to provide more accurate extrapolations.  It no longer needs to depend on primary index or PARTITIONstats, as was the case in earlier releases,  to perform good extrapolations when it finds statistics on a table to be stale.
Here’s an example of what the most recent summary statistic for the Items table looks like:
SHOW SUMMARY STATISTICS VALUES ON Items;
 
COLLECT SUMMARY STATISTICS
            ON CAB.Items
            VALUES
(
 /** TableLevelSummary **/
 /* Version           */ 5,
 /* NumOfRecords      */ 50,
 /* Reserved1         */ 0.000000,
 /* Reserved2         */ 0.000000,
  
 /* SummaryRecord[1] */
 /* Temperature           */ 0,
 /* TimeStamp             */ TIMESTAMP '2011-12-29 13:30:46',
 /* NumOfAMPs             */ 160,
 /* OneAMPSampleEst       */ 5761783680,
 /* AllAMPSampleEst       */ 5759927040,
 /* RowCount              */ 5759985050,
 /* DelRowCount           */ 0,
 /* PhyRowCount           */ 5759927040,
 /* AvgRowsPerBlock       */ 81921.871617,
 /* AvgBlockSize          */ 65024.000000,
 /* BLCPctCompressed      */ 0.00,
 /* BLCBlkUcpuCost        */ 0.000000,
 /* BLCBlkURatio          */ 0.000000,
 /* RowSizeSampleEst      */ 148.000000,
 /* Reserved2             */ 0.000000,
 /* Reserved3             */ 0.000000,
 /* Reserved4             */ 0.000000
);