How much space does xml cost in SQL Server 2005? This question seems to come up a lot because xml has a lot of meta data in it and to top it off SQL Server saves the xml data type using UTF-16 encoding. This is, at least for xml that is mostly ascii type characters, up twice as much space as the encoding you typically see for xml, UTF-8. So if you just want to save xml in SQL Server then return it later you're better off storing as VARCHAR(MAX) or maybe VARBINARY(MAX), right? Maybe not, but that is what this article is going to look at...
Even before we look at the space issues involved I'd like to mention that I think it is really being "too clever by half" to store xml as anything other than xml. Imagine if someone gave you a CSV file that looked like:
pipe, 100
rope, 250
and and told you the first column was the name of an item and the second was the price in whole dollars, then asked you to make a table and import it into SQL Server. You could make a table that looked like:
CREATE TABLE Stuff
(
name VARCHAR(MAX),
price VARCHAR(MAX)
)
Later when people when to use the table they could issue a query like this:
SELECT name, CAST(price AS INT) FROM Stuff
Of couse if one of the files you imported had a line like:
hose, XVII
the query would get an error at runtime when it ran. The problem here was that the data was stored as something that it wasn't. Storing xml as a string or binary is just like store an INT as text. When you insert something into an xml column SQL Server makes sure that is really is xml or rejects it, so you will not have any errors if you happen to make use of some of the xml functionality SQL Server 2005 provides. Even if you don't do make any use of the xml processing functionality in SQL Server, the client that reads the xml as text you've stored will get the error. It's usually best to catch errors as early as possible, especially if the client thinks your server made the error :-).
OK, now lets look at how much space things really take up. To do this we are going to make four tables that store xml each in a different way, fill them up with a bunch of xml then see how big they are. You can down load the code for trying these experiments yourself from http://www.pluralsight.com/dan/samples/whatcostxml.zip.
-- save as binary
CREATE TABLE XmlBinSizeTest
(
data VARBINARY(MAX)
)
--save as varchar
CREATE TABLE XmlTextSizeTest
(
data VARCHAR(MAX)
)
--save save as xml with a schema
CREATE TABLE XmlSchemaSizeTest
(
data xml(XmlSizeSchema)
)
-- save as xml without a schema
CREATE TABLE XmlSizeTest
(
data xml
)
Now we want to fill these tables up with a big walop of xml. When I need some static string or xml for testing I usually wrap it in a function so I don't have to put the literal text in my SQL expressions.
CREATE
--ALTER
FUNCTION XmlSizeTestData4()
RETURNS VARCHAR(MAX)
AS
BEGIN
RETURN
'http://www.w3.org/2001/XMLSchema-instance">
guy1@adventure-works.com" EmailPromotion="0" Phone="320-555-0195" PasswordHash="1196EB7D5425B281CACDCDD2F60F52D9689D9E49" PasswordSalt="Lanmhoo=" rowguid="D366A33A-8EDE-42BD-BF79-3E7FB9713FE1" ModifiedDate="1996-07-24T00:00:00"/>
-- a few hundred more E's here -->
'
END
Then a little batch to fill the tables:
DECLARE @t VARCHAR(MAX)
SET @t = dbo.XmlSizeTestData4();
DECLARE @index INT
SET @index = 100
WHILE @index > 0
BEGIN
SET @index = @index - 1
INSERT INTO XmlTextSizeTest VALUES (@t);
INSERT INTO XmlSizeTest VALUES (@t);
INSERT INTO XmlSchemaSizeTest VALUES (@t);
INSERT INTO XmlBinSizeTest VALUES (CAST(@t AS VARBINARY(MAX)));
END
We can use sys.allocation_units to find out how much space the tables take up. It takes a few joins to make this work off of table names...
SELECT o.name AS table_name, au.type_desc, au.used_pages
FROM sys.allocation_units AS au
JOIN sys.partitions AS p ON au.container_id = p.partition_id
JOIN sys.objects AS o ON p.object_id = o.object_id
WHERE o.name in (N'XmlTextSizeTest',
N'XmlSizeTest',
N'XmlSchemaSizeTest',
N'XmlBinSizeTest')
In SQL Server 2005 large objects, i.e. things like xml and VARCHAR(MAX) overflow in to supplementary pages outside of the pages for the table itself. This query tells us about both the pages used in the table and the extra overflow pages. Note that DBCC CHECKTABLE won't give you the right answer, because it won't tell you about the overflow pages. Running this query produces:
table name type_description used pages
XmlBinSizeTest IN_ROW_DATA 2
XmlBinSizeTest LOB_DATA 2802
XmlTextSizeTest IN_ROW_DATA 2
XmlTextSizeTest LOB_DATA 2802
XmlSchemaSizeTest IN_ROW_DATA 2
XmlSchemaSizeTest LOB_DATA 4226
XmlSizeTest IN_ROW_DATA 2
XmlSizeTest LOB_DATA 2907
The numbers tell an interesting story. First of all no suprise, storing text as binary takes the same amount of space as storing it as VARCHAR(MAX). However storing xml without a schema takes only about 5% more space than storing it as VARCHAR(MAX) even though the xml is being stored as UTF-16, not single byte characters as VARCHAR does. How can this be?
Let's use different xml data that really exacerbates the difference between storing xml as text and storing xml as xml. Here is another function that cobbles up some xml on the fly:
CREATE
FUNCTION XmlTestData1()
RETURNS VARCHAR(MAX)
AS
BEGIN
DECLARE @t VARCHAR(MAX)
SET @t = CAST('+ REPLICATE(CAST(' ' AS VARCHAR(MAX)), 10000)
+ '>abc'
RETURN @t
END
It's basically inserting a lot of ignorable space into the the xml. A fragment of the xml it generates looks like:
Actually it has a lot more spaces than that. To try this next you should drop all the tables and recreate them, then use the batch that fill the tables, but change the source of the xml to dbo.XmlTestData1(). After you run that batch that finds the sizes you see something like:
table name type_description used pages
XmlBinSizeTest IN_ROW_DATA 2
XmlBinSizeTest LOB_DATA 252
XmlTextSizeTest IN_ROW_DATA 2
XmlTextSizeTest LOB_DATA 252
XmlSchemaSizeTest IN_ROW_DATA 3
XmlSchemaSizeTest LOB_DATA 0
XmlSizeTest IN_ROW_DATA 3
XmlSizeTest LOB_DATA 0
It looks like saving this xml as text or binary takes up about two orders of magnitude more space than just saving it as xml!
When SQL Server saves xml it need not save it as literal text, and it doen't. It doesn't have to save the the "pointies" and it doesn't have to save whitespace that doesn't matter. And most of the the time the whitespace between consecutive opening tags doesn't matter. In other words SQL Server 2005 goes out of its way to efficiently store xml.
So it not a slam dunk to figure out which way of storing xml takes up the least amount of space. The point here isn't that xml will always take about the same or less as a corresponding VARCHAR(MAX), because sometimes the VARCHAR(MAX) or VARBINARY(MAX) will take up less space. But you will probably find that the difference isn't as great as you might suspect. In any case this is a database we are talking about, you should store xml as xml.
Dan
Posted
Aug 06 2006, 03:36 PM
by
dan-sullivan