/*
Psuedo Median Aggregrate
In an item I previously blogged, YAMC, I showed how to use the ROW_NUMBER() function in SQL Server 2005 to do a pretty efficient median calculation. Being able to calculate the median in an expression can be useful, but a median aggregate function would be much more useful.
Well T-SQL, even in SQL Server 2005, doesn't provide a median aggregate function and the is no way to write one using user defined aggregates written in a CLR language either.
Remember that an aggregate is set in, scalar out. YAMC was an aggregate calculation, but not an aggregate function. One of the useful things about aggregate function, like SUM(), is that they can be used in conjunction with GROUP BY phrases, so you can get an aggregate per some kind of value.
We are going to look at how to extend the YAMC to use a GROUP BY phrase so that it can be used, efficiently, to produce the same results as a MEDIAN() aggregate function would.
We will use the sales table that follows as example data. A script that will initialize it with some sample data is at the end of this blog item.
*/
Create table Sales
(
PONumber int Primary Key,
Value Money,
State NCHAR(2)
)
/*
A simple aggregate calculation could be the SUM of the values by state.
*/
SELECT State, SUM(value) FROM Sales GROUP BY State
/*
Here is a little bit of the output...
NB 891760.7953
WI 1001671.6347
PA 962849.0728
IL 1055766.3988
What we would like to do thougth is:
SELECT State, MEDIAN(Value) FROM Sales GROUP BY State
but that won't work because T-SQL has no MEDIAN aggregate function.
We could calculate the median of the whole table or a specific state using YAMC by modifying it a bit to just use rows from the state we are interest in.
*/
with positions
as
(
select (1 + Count(*)) / 2 as mid,
1-(count(*) % 2) as even
from Sales where State = 'NB'
),
rows
as
(
select value,
ROW_NUMBER() over (order by value) as rn
from Sales where State = 'NB'
)
select 'NB' as State, AVG(value) from rows JOIN positions on
rn in (positions.mid, positions.mid + positions.even)
/*
And we get...
NB 4218.464
It turns out ROW_NUMBER() has a friend named PARTITION BY, whose purpose in spirit is the same as GROUP BY. In fact by using PARTITION BY and GROUP BY together we can do an aggregate calculation grouped by a value, just as you can using the built in aggregate functions.
Here is a "psuedo" median aggregate that calculates the median sales value by state.
*/
-- psuedo median aggregate
WITH
positions
AS
(
SELECT (1 + count(*)) / 2 AS mid,
1-(count(*) % 2) AS even,
state
FROM Sales GROUP BY state
),
rows AS
(
SELECT value, state, row_number() OVER
(PARTITION BY state order by value) AS rn FROM sales
)
SELECT rows.state, AVG(rows.value) AS median
FROM rows JOIN positions
ON rows.state = positions.state
AND rows.rn IN
(positions.mid, positions.mid + positions.even)
GROUP BY rows.state
/*
This explaination assumes you have already read the YAMC blog item.
Similar as in YAMC, positions calculates the mid and even values, but this time groups the values by state.
Again, similar to YAMC, rows associates a row number with each value, but this time partitions the row numbers by state. This means that each state will have its own set of row numbers starting with one.
The SELECT that follows the CTE's is a bit more complex than the one in YAMC. It joins together positions and rows just as YAMC did. However it adds to the predicate a clause that makes the row and positions have the same state. And at the end it groups the AVG() aggregate by state.
Here is a little bit of what it produces...
AK 5158.6132
AL 5222.3433
AR 5347.351
AZ 5190.8084
CA 5015.9836
NB 4218.464
It would be much easier, and a lot more flexible, if T-SQL just gave us a MEDIAN() aggregate function, but for now at least we can do a real Median grouped by value if we want to.
Since this idiom accomplises pretty much what an aggregate function does, I wonder if the SQL team could cook up a syntax that used PARTITION BY and GROUP BY so that we could write our own user defined aggregates in T-SQL? One can dream...
*/
-- genrates test data
DECLARE @index int
SET @index = 10000
WHILE @index > 0
BEGIN
DECLARE @state NVARCHAR(2)
DECLARE @rand float
SET @rand =RAND() * 50
SET @state = CASE
WHEN @rand < 1 THEN N'AL'
WHEN @rand < 2 THEN N'AK'
WHEN @rand < 3 THEN N'AZ'
WHEN @rand < 4 THEN N'CA'
WHEN @rand < 5 THEN N'CO'
WHEN @rand < 6 THEN N'CT'
WHEN @rand < 7 THEN N'DE'
WHEN @rand < 8 THEN N'FL'
WHEN @rand < 9 THEN N'GA'
WHEN @rand < 10 THEN N'HI'
WHEN @rand < 11 THEN N'ID'
WHEN @rand < 12 THEN N'IL'
WHEN @rand < 13 THEN N'IA'
WHEN @rand < 14 THEN N'IN'
WHEN @rand < 15 THEN N'KS'
WHEN @rand < 16 THEN N'KY'
WHEN @rand < 17 THEN N'LA'
WHEN @rand < 18 THEN N'MA'
WHEN @rand < 19 THEN N'MD'
WHEN @rand < 20 THEN N'MN'
WHEN @rand < 21 THEN N'MI'
WHEN @rand < 22 THEN N'MS'
WHEN @rand < 23 THEN N'MO'
WHEN @rand < 24 THEN N'MT'
WHEN @rand < 25 THEN N'NB'
WHEN @rand < 26 THEN N'NV'
WHEN @rand < 27 THEN N'NH'
WHEN @rand < 28 THEN N'NJ'
WHEN @rand < 29 THEN N'NM'
WHEN @rand < 30 THEN N'NY'
WHEN @rand < 31 THEN N'NC'
WHEN @rand < 32 THEN N'ND'
WHEN @rand < 33 THEN N'OH'
WHEN @rand < 34 THEN N'OK'
WHEN @rand < 35 THEN N'OR'
WHEN @rand < 36 THEN N'PA'
WHEN @rand < 37 THEN N'RI'
WHEN @rand < 38 THEN N'SC'
WHEN @rand < 39 THEN N'SD'
WHEN @rand < 40 THEN N'TN'
WHEN @rand < 41 THEN N'TX'
WHEN @rand < 42 THEN N'UT'
WHEN @rand < 43 THEN N'VT'
WHEN @rand < 44 THEN N'VA'
WHEN @rand < 45 THEN N'WA'
WHEN @rand < 46 THEN N'WV'
WHEN @rand < 47 THEN N'WI'
WHEN @rand < 48 THEN N'WY'
WHEN @rand < 49 THEN N'AR'
WHEN @rand < 50 THEN N'AL'
END
DECLARE @value MONEY
SET @value = RAND() * 10000
INSERT INTO SALES VALUES (@index, @value, @state)
SET @index = @index - 1
END
Posted
Sep 29 2005, 08:26 AM
by
dan-sullivan