Wednesday, October 28, 2015

AVG: What does it Mean?

AVG is an aggregate function in SQL to compute the "average" of a set of values. More precisely, it computes the mean of those values. And even more precisely, AVG computes what is known as the arithmetic mean. The underlying math is so deceptively simple, it's easy to believe one understands the result and how to apply it. But keep reading! Learning about "average" is like uncovering an iceberg. There's more to it than meets the eye.
SQL Statistic Series: You are reading part of an ongoing series covering the statistical functions available in the Oracle Database implementation of SQL.

Measure of Central Tendency

The arithmetic mean is considered a measure of central tendency. Computing such a measure is an attempt to come up with a single value through which to conveniently characterize, discuss, and compare different data sets.
Figure 1 shows the math. It's very simple: Just sum the values and divide by their quantity. The first three examples show higher values dragging the mean upward. The final example piles on some low values to drag the mean downward. What you see is characteristic of the arithmetic mean. Any value added above or below the mean will move the mean in the direction of the new value. The move may be slight, or it may be large, but it will be non-zero.
Figure 1. Examples of computing the arithmetic mean

Mean as Representative Value

An ideal use of arithmetic mean is to provide a single representative to stand in for a collection values, and you're using that representative value in some calculations. You want the math to work out as if you had looked at each value individually, but you don't want the tedium of showing each separate computation.
Example Data: Download and execute avgExamples.sql to create the example tables and data used in this article.
Pretend you're managing a combination gas station and convenience store. You're open 24 hours a day, seven days per week. It is January 2013. Your target revenue goal for the month is $500,000. Divide by 31 days in January, and you must push at least $16,129 in sales per day. The following table tracks your daily sales progress:
SELECT *
FROM daily_sales
WHERE daily_date >= DATE '2013-01-01'
  AND daily_date < DATE '2013-02-01'
ORDER BY daily_date;

DAILY_DATE          SALES_AMT
------------------ ----------
01-JAN-13               17032
02-JAN-13               15043
03-JAN-13               16783
04-JAN-13               23343
05-JAN-13               25703
06-JAN-13               19538
07-JAN-13               18001
08-JAN-13               17154
09-JAN-13               26235
10-JAN-13               14983
11-JAN-13               16010
12-JAN-13                4300
13-JAN-13                2874
Your numbers look good until the 12th, when they fall off a cliff. Freezing rain struck your area that day. Ice has played havoc on tree limbs and power lines, leaving you without power for most of two days. Gasoline has stayed in the underground tanks unpumped. Coffee pots have been cold. For two days all you've sold are potato chips and soda pop. You are worried. Your early numbers are good, but have these two days without power been enough to put you below where you need to be?
Knowing you need to average $16,129 per day for the month, you sit down at your computer and type in a query:
SELECT ROUND(AVG(sales_amt))
FROM daily_sales
WHERE daily_date >= DATE '2013-01-01'
  AND daily_date < DATE '2013-02-01';

ROUND(AVG(SALES_AMT))
---------------------
16692
Whew! This is good news. Even with those two very bad days, you're still trending above your $16,129 target. Your hard work at exceeding plan is paying off by providing you a buffer in these difficult circumstances.
Looking for even more reassurance, you query your prior two months sales:
SELECT EXTRACT (MONTH FROM daily_date) AS "Month", 
       ROUND(AVG(sales_amt)) AS "Avg Sales"
FROM daily_sales
WHERE daily_date >= DATE '2012-11-01'
  AND daily_date < DATE '2013-01-01'
GROUP BY EXTRACT (MONTH FROM daily_date)
ORDER BY EXTRACT (MONTH FROM daily_date);

     Month Avg Sales
---------- ----------
        11      18817
        12      18803
These numbers are great news. You're current mean daily sales is just barely above the line at $16,692, but your track record from the prior two months indicates significant upside potential.
For one last bit of reassurance, you decide query a running average for the month of December. You want to see how the mean daily sales changed as the month progressed. Figure 2 illustrates. Notice how each day's mean sales amount takes into account the sales history from all the prior days.
Figure 2. The concept of a running average
SQL window functions make running (and moving too) averages fairly easy. Following is a query making use ofAVG...OVER to compute the running mean daily sales for December 2012. See my article You've Got Framing! if you're unfamiliar with the OVER clause and want more detail on how it works in this scenario.
SELECT daily_date, 
       sales_amt, 
       ROUND(AVG(sales_amt) OVER (ORDER BY daily_date)) AS "Running Avg"
FROM daily_sales
WHERE daily_date >= DATE '2012-12-01'
  and daily_date < DATE '2013-01-01'
ORDER BY daily_date;

DAILY_DATE          SALES_AMT Running Avg
------------------ ---------- -----------
01-DEC-12               14530       14530
02-DEC-12               14230       14380
03-DEC-12               15830       14863
...
22-DEC-12               19235       17610
23-DEC-12               20133       17720
24-DEC-12               22839       17933
25-DEC-12               21833       18089
26-DEC-12               20593       18185
27-DEC-12               20001       18252
28-DEC-12               19848       18309
29-DEC-12               23134       18476
30-DEC-12               22783       18619
31-DEC-12               24323       18803
These results are excellent news! They show a clear and sharp movement upwards in the running average toward the end of the month. Querying for November sales, you see the very same pattern in that month. There's every reason to believe you can repeat the pattern for January. You confidently report to your regional manager that the power loss is a minor blip and nothing to be concerned about. Your daily mean sales remain above the mark; you are on track to end the month ahead of plan.
Note: Means can be useful in communicating the big picture swiftly. Recite a list of daily sales numbers and a busy manager's eyes will glaze over. Give him a single number instead, such as the mean daily sales so far in the month, and you've conveyed the big picture without burying the manager in the details.

Nulls and their Effect

SQL's AVG function ignores nulls. This is by design, and it's a good decision. If you must, you can apply a function such as COALESCE to coerce a null to some other value such as zero. Here's a trivial example to illustrate:
SELECT AVG(sales_amt) AS "Normal",
       AVG(COALESCE(sales_amt, 0)) AS "Coerced"
FROM null_demo;

    Normal    Coerced
---------- ----------
         3        1.2
And here is the underlying data:
SET NULL "NULL"
SELECT * FROM null_demo;

DAILY_DATE          SALES_AMT
------------------ ----------
01-FEB-13                   2
02-FEB-13                   4
03-FEB-13                NULL
04-FEB-13                NULL
05-FEB-13                NULL
The normal invocation of AVG takes only the non-null values 2 and 4 into account. It returns the mean as (2 + 4) / 2 = 3. The second implementation treats nulls as if they were zeros. The rather odd effect is to return a mean that does not fall between the values in the table. That mean has no meaning either. What possible business meaning can the value 1.2 convey in this scenario? None at all. Think about the following two statements:
  • Average sales are $3.00 dollars per day, but we are missing data for three of the five days.
  • Average sales are $1.20 dollars per day, and I'm going to leave you with the impression that no data is missing.
Which of these would help you make a better business decision? Would you rather hear the truth? Or bury your head in the proverbial sand?

Weighted Means

Sometimes it's not enough to simply take the average of a single column. Picture yourself a fleet manager responsible for a small fleet of delivery trucks. You're responsible for managing fuel costs to an average of $3.50 or less per gallon for the month. Fail, and you lose your bonus pay. Wanting to get a quick sense of how things are going, you execute the following query:
SELECT ROUND(AVG(PRICE),2) AS "Avg Per Gal"
FROM fleet_fuel
WHERE buy_date >= DATE '2013-01-01'
  AND buy_date < DATE '2013-02-01';

Avg Per Gal
-----------
       3.46
This result looks great. Doesn't it? You've applied the AVG function. You've gotten a result. You're well under the mark. You feel your bonus is secure. Except you're wrong.
You've forgotten to factor in the number of gallons purchased at each price. Each per-gallon purchase price needs to be weighted by the number of gallons purchased at that price. The AVG function doesn't help here. You can apply SUM instead, and do the weighting math yourself. Here's how:
  1. Begin with gallons * price to weight each price by the number of gallons sold.
  2. Apply the SUM function, as in SUM(gallons * price) to give a total off all you've spent on fuel so far during the month.
  3. Divide by the number of gallons to get the mean price per gallon.
Figure 3 illustrates this logic and why it works. Callout #1 shows the data for two fuel purchases. Callout #2 shows a naive and incorrect approach to computing the mean. You do a get mean of two values, but that mean does not mean what it should mean. Callout #3 highlights the correct approach. Multiplying each price by the number of gallons purchased at that price gives the effect of having a record for each individual gallon purchased. The resulting mean is then truly the mean price per gallon.
Figure 3. Weighting the mean
Fortunately, the needed SQL expression is far easier to write than Figure 3 was to draw. Here's a query implementing the logic from Figure 3 for a weighted mean:
SELECT ROUND(SUM(gallons * price)/SUM(gallons),2) AS "Avg Per Gal"
FROM fleet_fuel
WHERE buy_date >= DATE '2013-01-01'
  AND buy_date < DATE '2013-02-01';

Avg Per Gal
-----------
       3.52
This result's more worrisome, isn't it? Your target is $3.50 per gallon. Your current, weighted mean is $3.52 per gallon. That bonus isn't so secure after all. Fortunately, you have time left in the month to act. Good thing you knew about weighted means. You may be able to salvage that bonus yet.

When Median and Mode are Better

Say that you have a friend planning a visit to Michigan's Upper Peninsula. Your friend plans to drive across region, and asks you for some intelligence on fuel prices. Here is a situation in which the goal is not really to make the math work out accounting-style as in the examples so far. Instead, you're trying to distill a complex picture into just a few numbers that you and your friend can get a grip on. The arithmetic mean might not be your best choice here.
It may in fact be reasonable to provide the mean per-gallon fuel price. After all, gasoline is a competitive market and prices tend to cluster tightly around a center. There aren't likely to be many outliers. You could tell your friend:
"The mean fuel price in the region is $3.55 per gallon."
However, an even more meaningful statement is:
"Half the gas stations charge less than $3.50 per gallon, and half charge more."
Here you are giving what is known as the median: $3.50 per gallon in this case. This is useful information, and I plan a follow-up article soon on how to compute the median and apply it. Your friend can reasonably decide to treat $3.50 as the maximum he'll pay per gallon. Knowing that half the fuel stations charge less, and presuming a reasonably even geographic distribution, your friend should be able to make the trip and hold out for the less-expensive gas stations.
But wait! There's more. Perhaps you can provide the mode as well, also a planned future topic. The mode refers to the most commonly observed value. You investigate and find the fuel prices in your region to bebimodal. That means two prices are tied for being most commonly observed. The two prices are $3.46 and 3.52. Now you can paint a more detailed picture for your friend, and with just a very few numbers:
"Half the gas stations charge less than $3.50 per gallon, and half charge more."
"The two most common prices are $3.46 per gallon, and $3.52 per gallon."
With this information, your friend might decide to hold out for $3.46 per gallon or less. Again, that's presuming a not-too-uneven geographic distribution. (There is only so much you can convey, after all, with just three numbers).
Choosing the correct statistic isn't just about the data you're working with. You can't look at a data set and say arbitrarily that the median applies, or that the mean is a better choice. Choose your statistics based upon the intelligence you're trying to derive, the decisions you're hoping to make, the picture you want to paint.
Note: Median is in many cases more useful or desirable than the arithmetic mean. Yet SQL has a history of poor support for computing the median. (Oracle Database does currently support a MEDIANfunction). You'll find plenty of convoluted techniques in older books on SQL for computing the median, some that work well and others that work mostly – except for some corner cases. It's always been trivial to compute the mean though, and sometimes the mean has been chosen only for that reason. It's the "when all you have is a hammer" effect.

Circular Data

Watch out for circular number systems when computing the mean, or any other measure of central tendency. Look at Figure 4 showing several times clustered around midnight. A naive approach to computing the mean would put it at 12:02:15, twelve hours opposite where it ought to be. In this particular example, the mean as a measure of central tendency is correctly placed at 00:02:15, two minutes and 15 seconds past midnight.
Figure 4. Mean in a circular number system

Wall-clock time and compass direction are two examples of circular systems in which care is needed when computing the mean. Various methods exist. The United States Environmental Protection Agency employs the so-called Yamartino method developed in 1984 by Robert J. Yamartino in their studies involving wind direction. That's just the tip of the iceberg. At least one book, Statistical analysis of Circular Data by N. I. Fisher, has been written on the topic.

Non-Additive Data

The arithmetic mean is based around addition. Remember the daily sales example early in this article? Each day of the month you sell a certain amount, and you add those amounts together for a monthly total. For example:
14530 + 14230 + 15830 ... 23134 + 22783 + 24323
Two other common patterns you'll encounter involve rates and products. It's useful to know how to spot them so that you don't apply AVG when it shouldn't be applied.
Rates often involve the word "per", as in "miles per gallon" and "kilometers per hour". Imagine if I made the same 60 km trip twice in the same day. First I drove it at 60 kph. Then I returned home at a slower, 40 kph. Here's how that looks:
60km at 60kph + 60km at 40kph
This is not the simple, additive pattern to which arithmetic mean applies. The correct mean speed is not 50 kph like you might expect. It's actually 48 kph, because you spend more time (more hours) driving the slower trip. Another type of mean, the harmonic mean, will give the correct answer and should generally be applied to situations like this involving rates. The math behind the harmonic mean, and especially why that math works, is really interesting and I plan a future article on the topic.
The other commonly encountered pattern involves products – values multiplied together instead of being added. Remember during the runup to the year 2000 how all of us in the computing field would get big raises each year as the industry frantically worked to save the world from six-digit date meltdown? Those were good times. Imagine three yearly raises in pay of 10%, 20%, and 30%. Let your starting salary be N, and here's what you have:
 N * 1.10 * 1.20 * 1.30
Once again, this is not an additive sequence. Rather, it's a series of products. The mean pay raise is not 20%. The geometric mean applies here, and the mean increase in pay works out to 19.72% per year. You can prove that by working through some math:
N * 1.10 * 1.20 * 1.30 = N * 1.1972 * 1.1972 * 1.1972
1.10 * 1.20 * 1.30 = 1.7160, 
and 1.1972 * 1.1972 * 1.1972 = 1.7159,
and finally...N * 1.7160 ≈ N * 1.7159
Rounding accounts for the difference of 0.0001 in the result. Geometric mean is also a topic I plan for a future article.

The Sum of it All

Choose your average carefully depending upon what it is you're trying to convey. Arithmetic mean is ideal when you need the math to work out accounting-style, as in the daily sales and fuel cost examples. Watch for non-additive cases that might require other types of mean, such as the harmonic and geometric.
Arithmetic mean is prone to what is termed skew. A small number of outlying values can give a false impression of central tendency. Figure 1 hints at how this may be so. A future article on standard deviation will delve into this topic more closely.
The potential for skew is one reason the median is often favored over the mean. This is why the U.S. Census Bureau tracks and reports a median  annual household income rather than a mean income. Median is often more appropriate than the mean in cases where you are working to promote understanding  by painting a picture of the current state of things.
Finally, it isn't possible to look at a data set and declare authoritatively that just one statistic applies, or that only one type of average should be used. It may be perfectly reasonable to look at multiple statistics together, such as mean and mode in the example of an automobile trip across the Upper Peninsula. The bottom line is to choose the statistic that best conveys the message, and that best helps you manage to your desired result.

No comments:

Post a Comment