Psuedo Median Aggregate

/*

Psuedo Median Aggregrate

 

In an item I previously blogged, YAMC, I showed how to use the ROW_NUMBER() function in SQL Server 2005 to do a pretty efficient median calculation. Being able to calculate the median in an expression can be useful, but a median aggregate function would be much more useful.

 

Well T-SQL, even in SQL Server 2005, doesn't provide a median aggregate function and the is no way to write one using user defined aggregates written in a CLR language either.

 

Remember that an aggregate is set in, scalar out. YAMC was an aggregate calculation, but not an aggregate function. One of the useful things about aggregate function, like SUM(), is that they can be used in conjunction with GROUP BY phrases, so you can get an aggregate per some kind of value.

 

We are going to look at how to extend the YAMC to use a GROUP BY phrase so that it can be used, efficiently, to produce the same results as a MEDIAN() aggregate function would.

 

We will use the sales table that follows as example data. A script that will initialize it with some sample data is at the end of this blog item.

*/

 

Create table Sales

(

PONumber int Primary Key,

Value Money,

State NCHAR(2)

)

/*

A simple aggregate calculation could be the SUM of the values by state.

*/

 

SELECT State, SUM(value) FROM Sales GROUP BY State

 

/*

Here is a little bit of the output...

 

NB 891760.7953

WI 1001671.6347

PA 962849.0728

IL 1055766.3988

 

What we would like to do thougth is:

 

SELECT State, MEDIAN(Value) FROM Sales GROUP BY State

 

but that won't work because T-SQL has no  MEDIAN aggregate function.

 

We could calculate the median of the whole table or a specific state using YAMC by modifying it a bit to just use rows from the state we are interest in.

 

*/

 

with positions

as

(

select (1 + Count(*)) / 2 as mid,

1-(count(*) % 2) as even

from Sales where State = 'NB'

),

rows

as

(

select value,

ROW_NUMBER() over (order by value) as rn

from Sales where State = 'NB'

)

select 'NB' as State, AVG(value) from rows JOIN positions on

rn in (positions.mid, positions.mid + positions.even)

/*

 

And we get...

 

NB 4218.464

 

It turns out ROW_NUMBER() has a friend named PARTITION BY, whose purpose in spirit is the same as GROUP BY. In fact by using PARTITION  BY and GROUP BY together we can do an aggregate calculation grouped by a value, just as you can using the built in aggregate functions.

 

Here is a "psuedo" median aggregate that calculates the median sales value by state.

 

*/

 

-- psuedo median aggregate

WITH

positions

AS

(

SELECT (1 + count(*)) / 2 AS mid,

1-(count(*) % 2) AS even,

state

FROM Sales GROUP BY state

),

rows AS

(

SELECT value, state, row_number() OVER

(PARTITION BY state order by value) AS rn FROM sales

)

SELECT rows.state, AVG(rows.value) AS median

FROM rows JOIN positions

ON rows.state = positions.state

AND rows.rn IN

(positions.mid, positions.mid + positions.even)

GROUP BY rows.state

/*

This explaination assumes you have already read the YAMC blog item.

 

Similar as in YAMC, positions calculates the mid and even values, but this time groups the values by state.

 

Again, similar to YAMC, rows associates a row number with each value, but this time partitions the row numbers by state. This means that each state will have its own set of row numbers starting with one.

 

The SELECT that follows the CTE's is a bit more complex than the one in YAMC. It joins together positions and rows just as YAMC did. However it adds to the predicate a clause that makes the row and positions have the same state. And at the end it groups the AVG() aggregate by state.

 

Here is a little bit of what it produces...

 

AK 5158.6132

AL 5222.3433

AR 5347.351

AZ 5190.8084

CA 5015.9836

NB 4218.464

 

It would be much easier, and a lot more flexible, if T-SQL just gave us a MEDIAN() aggregate function, but for now at least we can do a real Median grouped by value if we want to.

 

 

Since this idiom accomplises pretty much what an aggregate function does, I wonder if the SQL team could cook up a syntax that used PARTITION BY and GROUP BY so that we could write our own user defined aggregates in T-SQL? One can dream...

 

*/

 

-- genrates test data

DECLARE @index int

SET @index = 10000

WHILE @index > 0

BEGIN

DECLARE @state NVARCHAR(2)

DECLARE @rand float

SET @rand =RAND() * 50

SET @state = CASE

WHEN @rand < 1 THEN N'AL'

WHEN @rand < 2 THEN N'AK'

WHEN @rand < 3 THEN N'AZ'

WHEN @rand < 4 THEN N'CA'

WHEN @rand < 5 THEN N'CO'

WHEN @rand < 6 THEN N'CT'

WHEN @rand < 7 THEN N'DE'

WHEN @rand < 8 THEN N'FL'

WHEN @rand < 9 THEN N'GA'

WHEN @rand < 10 THEN N'HI'

WHEN @rand < 11 THEN N'ID'

WHEN @rand < 12 THEN N'IL'

WHEN @rand < 13 THEN N'IA'

WHEN @rand < 14 THEN N'IN'

WHEN @rand < 15 THEN N'KS'

WHEN @rand < 16 THEN N'KY'

WHEN @rand < 17 THEN N'LA'

WHEN @rand < 18 THEN N'MA'

WHEN @rand < 19 THEN N'MD'

WHEN @rand < 20 THEN N'MN'

WHEN @rand < 21 THEN N'MI'

WHEN @rand < 22 THEN N'MS'

WHEN @rand < 23 THEN N'MO'

WHEN @rand < 24 THEN N'MT'

WHEN @rand < 25 THEN N'NB'

WHEN @rand < 26 THEN N'NV'

WHEN @rand < 27 THEN N'NH'

WHEN @rand < 28 THEN N'NJ'

WHEN @rand < 29 THEN N'NM'

WHEN @rand < 30 THEN N'NY'

WHEN @rand < 31 THEN N'NC'

WHEN @rand < 32 THEN N'ND'

WHEN @rand < 33 THEN N'OH'

WHEN @rand < 34 THEN N'OK'

WHEN @rand < 35 THEN N'OR'

WHEN @rand < 36 THEN N'PA'

WHEN @rand < 37 THEN N'RI'

WHEN @rand < 38 THEN N'SC'

WHEN @rand < 39 THEN N'SD'

WHEN @rand < 40 THEN N'TN'

WHEN @rand < 41 THEN N'TX'

WHEN @rand < 42 THEN N'UT'

WHEN @rand < 43 THEN N'VT'

WHEN @rand < 44 THEN N'VA'

WHEN @rand < 45 THEN N'WA'

WHEN @rand < 46 THEN N'WV'

WHEN @rand < 47 THEN N'WI'

WHEN @rand < 48 THEN N'WY'

WHEN @rand < 49 THEN N'AR'

WHEN @rand < 50 THEN N'AL'

END

DECLARE @value MONEY

SET @value = RAND() * 10000

INSERT INTO SALES VALUES (@index, @value, @state)

SET @index = @index - 1

END

 


Posted Sep 29 2005, 08:26 AM by dan-sullivan
Filed under:

Comments

Security Briefs wrote Welcome Dan Sullivan!
on 09-29-2005 7:34 AM
Megan wrote re: Psuedo Median Aggregate
on 03-10-2006 3:18 PM
Dan-
This is great - for when I finally get SQL 2005, any suggestions for SQL 2000?
-Megan
Deferred Processing wrote CLR Based Histogram Functions and SQL Server 2005
on 08-26-2006 9:54 AM

Add a Comment

(required)  
(optional)
(required)  
Remember Me?