There are a couple of powerful technologies for processing native XML, XPath and XSLT. People often avoid processing native XML but instead convert the XML to an object model in a language they are used to and do “conventional” programming on that model. Even PowerShell itself does this with its fairly straightforward dotted syntax for accessing parts of an XML document and of course .NET, web service technologies, and SQL Server have their own ways to morph XML into a familiar object model.
There are probably a number reasons for this not the least of which is syntactic comfort… with some practice you actually can drive nails with a screwdriver and then you only need to learn how to use one tool to build a house. XSLT itself is often criticized as being too verbose but that is not really the case. And lastly the programming models for XPath and XSLT are different than that used in languages like C# or VB.NET; They are much more like SQL in that you don’t actually write a program but instead write a set of rules and throw data at them.
However if you are going to bump into XML in your travels, and you can be pretty sure that you will, it is really worth your while to become comfortable with at least the basics of XPath and XSLT because that knowledge will make a lot of programming jobs a lot easier. Let’s take a look at a simple example to see this. Here is a grocery list, XML-style, in the file groceries.xml
<GroceryList>
<Item>
<Dept>Produce</Dept><Name>Orange</Name><Price>3.20</Price>
</Item>
<Item>
<Dept>Meat</Dept><Name>Steak</Name><Price>13.20</Price>
</Item>
<Item>
<Dept>Produce</Dept><Name>Lettuce</Name><Price>1.34</Price>
</Item>
<Item>
<Dept>Meat</Dept><Name>Ham</Name><Price>11.41</Price>
</Item>
</GroceryList>
We can calculate the total of all of groceries using the PowerShell object model of XML with the following script;
PS C:\Demos> [xml]$list = get-content .\groceries.xml
PS C:\Demos> $list.GroceryList.Item | &{begin {$sum=0}
process{$sum += $_.Price} end {$sum}}
29.15
PS C:\Demos>
There are other ways to do this in PowerShell, but all involve iterating through the items to produce a sum. It turns out there is a simple XPath expression that calculates sum of the prices of the items in the list:
sum(GroceryList/Item/Price)
In fact it would be kind of nice if we had a way to “execute” and XPath expression easily in PowerShell. How about this?
PS C:\Demos> xeval groceries.xml "sum(GroceryList/Item/Price)"
29.15
PS C:\Demos>
This blog article is about processing XML using PowerShell and the typical sorts of things you run into when you do this. It uses some extension functions, xeval and xnav are their aliases, to do this processing. The xeval function is used to process an XML file using XPath expressions. The xnav function is used to turn literal XML into an XPathNavigator. Later blog articles will cover other ways to process XML using PowerShell.
A script to build these functions and their associated aliases is in a file named XSLT.ps1. This file and the examples in this blog article are available at http://www.pluralsight.com/dan/samples/ProcessingXMLPowershell.zip. These extension functions are not really any harder to use than the XML support built into PowerShell but are quite a bit more capable in what they can accomplish. After we look at using these extension functions we will look inside of XSLT.ps1 and see how it works.
The XSLT.ps1 file actually has some other extension functions that are not discussed in this blog article but will be in a future one.
In the first example of using xeval we just looked at, the first argument to the xeval function is the file path for the XML file you want to process. The second argument is the XPath expression you want evaluated. Of course to make good use of xeval you will have to be familiar with XPath. XPath is a W3 recommendation and is at http://www.w3.org/TR/xpath.
The XPath recommendation is certainly worth reading and contains many example of XPath expressions. Another good source to have at your side is “Essential XML Quick Reference” published by Addison-Wesley and written by Aaron Skonnard and Martin Gudgen.
Let’s start by looking at one of the issues you run into when working with XML. XML is often treated as though it were text and that is how PowerShell treats it. But XML is not plain ol’ text and the following examples will show that. We have another version of our xml grocery list in a file named GroceriesUC.xml. Let’s use our PowerShell script to process it.
PS C:\Demos> [xml]$list = get-content groceriesuc.xml
Cannot convert value "System.Object[]" to type
"System.Xml.XmlDocument".
Error: "Root element is missing."
At line:1 char:11
+ [xml]$list <<<< = get-content GroceriesUC.xml
PS C:\Demos>
Hmmmm, that generated an error. What’s going on here?
We often think of files as containing text, that is the characters we see on the printed page. But files don’t contain text, the are just a sequence of bytes. When someone gives you a “text” file you must know how that text was encoded into a sequence of bytes in order to be able to read it. PowerShell gives us a little help here in that the get-content cmdlet lets you specify the encoding of the file if you know it, or “unknown” if you don’t. Well we don’t know the encoding of the file so let’s tell PowerShell that the encoding is unknown and see what happens.
PS C:\Demos> [xml]$list = get-content GroceriesUC.xml -encoding unknown
Cannot convert value "????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????
to type "System.Xml.XmlDocument". Error:
"Data at the root level is invalid. Line 1, position 1."
At line:1 char:11
+ [xml]$list <<<< = get-content GroceriesUC.xml -encoding unknown
PS C:\Demos>
Looks like we are out of luck here too. It turns out the encoding of the file is UTF-16BE. That’s a standard encoding used for XML files that is a sequence of words with the high-order byte of the word coming first. You might see it in XML that is generated on non-Intel compatible processors. Now that we know that we know actual encoding we can pass the information onto PowerShell.
PS C:\Demos> [xml]$list = get-content GroceriesUC.xml
-encoding BigEndianUnicode
PS C:\Demos> $list.GroceryList.Item |
&{begin {$sum=0} process{$sum += $_.Price} end {$sum}}
29.15
PS C:\Demos>
Bottom line is when it comes to text unless you know the actual encoding you can’t depend on being able to read it. Earlier we said that XML wasn’t really text. To see what this means lets try that xeval function again on the GroceriesUC.xml file.
PS C:\Demos> xeval GroceriesUC.xml "sum(GroceryList/Item/Price)"
29.15
PS C:\Demos>
It works just fine and we don’t have to tell it what the encoding is. The reason for this is a requirement of every XML processor, i.e. a support library for XML such as the one in .NET that xeval uses, must be able to unambiguously figure out the encoding used in an XML file without any outside help. This is thought by many to be the key feature of XML and certainly is one of the reasons for its wide use today. It works so well that most people don’t even know it is a feature!
The built in processing in PowerShell using get-content makes a non-compliant XML processor. In some cases this isn’t that important but you should keep in mind that in the general case it is not useful for processing XML. If you want do know the details of how this “self-encoding” in XML works there is an explanation of it in Appendix F of the W3 Extensible Markup Language XML recommendation at http://www.w3.org/TR/xml/.
There is another issue that comes up when you deal with XML, namespaces. There are some who feel that namespaces are an unnecessary complication to XML, but they are important enough to have their own specification, Namespaces in XML which is at http://www.w3.org/TR/xml-names/. For those with an interest in such things the Extensible Markup Language XML is really just a grammar with a little over 80 productions with lots of comments in it, and Namespaces in XML just adds a few productions to that grammar. Regardless of how you feel about namespaces you will have to deal with them. Here is a different version of our grocery list. It is in the file named GroceriesNS.xml.
<GroceryList xmlns="urn:foo"
xmlns:r="urn:retail"
xmlns:w="urn:wholesale" >
<Stock>
<Dept>Produce</Dept><Name>Orange</Name>
<w:Price>3.20</w:Price><r:Price>4.20</w:Price>
</Stock>
<Stock>
<Dept>Meat</Dept><Name>Steak</Name>
<r:Price >14.20</r:Price><w:Price >13.20</w:Price>
</Stock>
<Stock>
<Dept>Produce</Dept><Name>Lettuce</Name>
<w:Price>1.34</w:Price><r:Price>2.34</r:Price>
</Stock>
<Stock>
<Dept>Meat</Dept><Name>Ham</Name>
<w:Price>11.41</w:Price><r:Price>14.41</r:Price>
</Stock>
</GroceryList>
This file is different from the Groceries.xml file in three ways. One is that it uses namespaces. Another is that it contains both a wholesale and a retail price for each item. It also uses Stock elements instead of Item elements; We will see why in a second. The Price elements are distinguished by their namespace, the ones prefixed with “r” are retail prices. The prices in the r:Price elements are the same as the corresponding Price elements in the Groceries.xml file. Let’s use PowerShell’s object model of XML to calculate the sum of the retail prices. PowerShell sees two Price elements under the Stock element, so it makes an array out of them. We will have pick which one to sum up.
PS C:\Demos> $list = get-content GroceriesNS.xml
PS C:\Demos> $list.GroceryList.Stock |
&{begin {$sum=0} process{$sum += $_.Price[1]} end {$sum}}
30.15
PS C:\Demos>
The retail prices in the GroceriesNS.xml file are the same as the ones in the unqualified prices in the Groceries.xml file so we should get the same answer as before, but we don’t. The problem we have run into is that Price elements are distinguished only by their namespace and not by there position in the file. Note that in the second Stock element in the file the wholesale price comes after the retail price. So we have to make sure that we pick the correct price element.
To distinguish a Price element we have to use a ParameterizedProperty named Item that PowerShell adds to an XML element. In many cases you will find it difficult to process XML using the PowerShell object model if the XML contains any Item elements because PowerShell uses this name for the ParameterizedProperty it adds to XML elements. This is why we changed the name of the Item element to Stock. If we had not made this change we would not have been able to process this XML file using the PowerShell object model of XML.
In any case the Item property allows us to specify both the name and the namespace of the element we want.
PS C:\Demos> $list = get-content GroceriesNS.xml
PS C:\Demos> $list.GroceryList.Stock |
%{$_.Item("Price", "urn:retail")} |
&{begin {$sum = 0} process {$sum += $_.get_InnerText()} end {$sum}}
29.15
PS C:\Demos>
Now we get the 29.15 just as we did when we processed the Groceries.xml file.
Now let’s do the same thing using xeval function.
PS C:\Demos> xeval GroceriesNS.xml
"sum(a:GroceryList/a:Stock/r:Price)" @{r="urn:retail";a="urn:foo"}
29.15
PS C:\Demos>
In this example xeval function has a third argument that is a dictionary that maps prefixes to the namespaces they represent in the XPath expression. You can see the GroceryList and Stock end up in the “urn:foo” namespace because of the “a” prefix and likewise Price ends up in the “urn:retail” namespace. Note that the prefix used in the XPath expression is not necessarily the same as that in the source XML file. There is no requirement the prefix used in an XPath expression be the same as that in the source XML file being processed; The key thing is that is specifies the proper namespace. Note that in the GroceriesNS.xml file the GroceryList and Stock element had no prefix but that the default namespace for the file was “urn:foo”.
Let’s look at some more things we can do with xeval. The second parameter of xeval may be an array of XPath expressions. xeval will evaluate each of these expressions.
PS C:\Demos> xeval GroceriesNS.xml "sum(a:GroceryList/a:Stock/r:Price)",
"count(a:GroceryList/a:Stock)" @{a="urn:foo";r="urn:retail"}
29.15
4
PS C:\Demos>
Here we calculated the sum of the retail prices and number of Stock items. Note that this example makes use of the fact that in PowerShell the “,” operator makes an array of the arguments it joins. Let’s carry this one step further.
PS C:\Demos> xeval GroceriesNS.xml "sum(a:GroceryList/a:Stock/r:Price)",
"count(a:GroceryList/a:Stock)",
"sum(a:GroceryList/a:Stock/r:Price)
div count(a:GroceryList/a:Stock)" @{a="urn:foo";r="urn:retail"}
29.15
4
7.2875
PS C:\Demos>
Here, beside the sum of the prices and the number of stock items, we calculate the average price of the stock items. The important point of these last few examples is that it is very common to calculate some value based on the content of an XML file. These calculations can be embedded in an XPath expression and you never have to “read”, i.e. pull out and interpret parts of, the XML file to do this.
You might think that all repeated a:GroceryList etc. might be inefficient or at least is tedious. First of all it’s not really inefficient at all to calculate a path multiple times in an XPath expression because the XPath engine that is evaluating these expression caches paths and reuses them when they appear again. As far as the tedium of typing them multiple times you can leverage PowerShell itself to simplify that.
PS C:\Demos> $s = "a:GroceryList/a:Stock"
PS C:\Demos> $p = "$s/r:Price"
PS C:\Demos> xeval GroceriesNS.xml "sum($p)",
"count($s)",
"sum($p) div count($s)" @{a="urn:foo";r="urn:retail"}
29.15
4
7.2875
PS C:\Demos>
Here we have made use of the fact that PowerShell will build a string out of a combination of literal text and variables. If the format of the XML file is pretty regular you can make the XPath expression used for the evaluation even more simple.
PS C:\Demos> xeval Groceries.xml "sum(//r:Price)" @{r="urn:retail"}
29.15
PS C:\Demos>
Of course here again you need some knowledge of XPath to simplify things. The “//” part of the XPath expression in this case really means “Find all the r:Price elements in the file.”
There is a hidden value in using XPath expressions to do calculations on an XML file; That expression can be used by anyone using any technology that implements XML support to do the same calculation on that file. In other words the XPath expression is a platform independent way of specifying how a calculation is done, it is not limited to PowerShell.
Sometimes you will have a literal string for your xml instead of a file. You can’t pass this directly to the xeval function because it will interpret that string as a file path and attempt to load a file.
The implementation of xeval internally uses an XPathNavigator to process the XML that is passed to it. This blog article isn’t going discuss the details of how XPathNavigator works, but xnav is an alias for a function that converts literal XML into an XPathNavigator. If the first parameter passed into xeval is an XPathNavigator it will use that navigator instead of interpreting it as a file path.
Here is an example of processing literal XML.
PS C:\Demos> $nav = xnav "<Stock><sku>ee-44</sku></Stock>"
PS C:\Demos> xeval $nav "string(//sku)"
ee-44
PS C:\Demos>
This example begins by using the xnav function to make an XPathNavigator out of some literal XML. This XPathNavigator is passed into the eval function. The XPath expression passed to xeval pulls out the stockroom unit from the literal XML.
Using the pipeline in PowerShell is a great way process XML. There are a number of grocery files with names like GroceriesNS1.xml, GroceriesNS2.xml and so on that we would like to process. We would like to calculate the value of each these files. This is what the GroceriesNS1.xml file looks like.
<GroceryList xmlns="urn:foo"
xmlns:r="urn:retail"
xmlns:w="urn:wholesale"
ID = "A-24"
>
<Stock>
<Dept>Produce</Dept><Name>Orange</Name>
<w:Price>114.20</w:Price><r:Price>3.41</r:Price>
</Stock>
<Stock>
<Dept>Meat</Dept><Name>Steak</Name>
<r:Price >13.20</r:Price><w:Price >14.20</w:Price>
</Stock>
<Stock>
<Dept>Produce</Dept><Name>Lettuce</Name>
<w:Price>21.34</w:Price><r:Price>1.36</r:Price>
</Stock>
</GroceryList>
First of all it’s straightforward to get the names of these files.
PS C:\Demos> get-childitem C:\Demos\* |
?{$_.Name -match "GroceriesNS\d+.xml"}
Directory: Microsoft.PowerShell.Core\FileSystem::C:\Demos
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 11/25/2006 10:15 AM 438 GroceriesNS1.xml
-a--- 11/25/2006 10:14 AM 324 GroceriesNS2.xml
-a--- 11/25/2006 10:16 AM 438 GroceriesNS3.xml
PS C:\Demos>
Note that the GroceryList element has an attribute name ID that identifies that list. We want to include that ID in our results.
PS C:\Demos> get-childitem C:\Demos\* |
?{$_.Name -match "GroceriesNS\d+.xml"} |
%{xeval "$_" "string(f:GroceryList/@ID)", "sum(//r:Price)" `
@{f="urn:foo";r="urn:retail"} }
A-24
17.97
31
54.4
109
57.97
PS C:\Demos>
In this example we pipe the file names into a script block that uses the xeval function. This uses XPath expressions to get both the ID of the GroceryList and the sum of its Price elements. Note the backtick and the end of the third line to insure the continuation of the command line.
The output we get is ID followed by sum. We might like something that produces a single line per GroceryList. We could pipe these results into another script block that aggregated these results by the pair… or we could use XPath to do the same thing.
PS C:\Demos> get-childitem C:\BlogArts\ProcessingXMLPowerShell\* |
?{$_.Name -match "GroceriesNS\d+.xml"} |
%{xeval "$_" "concat(string(f:GroceryList/@ID), ' : ', sum(//r:Price))" `
@{f="urn:foo";r="urn:retail"} }
A-24 : 17.97
31 : 54.400000000000006
109 : 57.97
Here we use the XPath concat function to produce a line per GroceryList report of the sum of the prices of each grocery list. You can produce some pretty fancy reports using just XPath expressions, but if they are much more complicated than the one in this example you will find it somewhat tedious to code them up. For more complicated reports XSLT is really a better choice and we will be looking at that in a later blog article. In any case this example has defined a report in terms of an XPath expression which anyone on any platform that implements XML can produce the same report. This example didn’t “code up” a report it made a rule that defined how the report was to be produced.
Now let’s look at the implementation. We will start with the eval function.
filter get-XSLT_XPathEvaluate
{
param($nav, [array]$computations, [hashtable]$namespaces)
if($nav -is [string])
{
$nav = get-XSLT_XPathNavigator $nav
}
if($nav -isnot [System.Xml.XPath.XPathNavigator])
{ throw "String file path or XPathNavigator required"}
$nm = get-XSLT_NamespaceManager $nav.NameTable $namespaces
foreach($n in $nav)
{
foreach($compute in $computations)
{
$n.Clone().Evaluate($compute, $nm)
}
}
}
set-alias xeval get-XSLT_XPathEvaluate
The xeval function uses three parameters. The first is a string or an XPathNavigator, the second is an array of XPath expressions, and the last is a dictionary of namespace mappings. It tests the first parameter to see if it is a string. If it is it uses the get-XSLTXPathNavigator function to make an XPathNavigator from the file path. We will look at the get-XSLTXPathNavigator function shortly.
Next it checks to make sure that the $nav variable is in fact an XPathNavigator and throws an error if it isn’t.
In order to use namespace with an XPathNavigator you need a construct called an XmlNamespaceManager. This construct holds the mappings of prefixes to namespaces. Both XPathNavigators and XmlDocuments store their associated XML in a non-textual, binary form for efficiency. Internally another construct, a NameTable, maintains a mapping between the names of elements and attributes, and their internal representation. The XmlNamespaceManager uses this NameTable in its constructor so that it can have the same mapping of names to internal representation that the XPathNavigator does.
Once the XmlNamespaceManager is constructed it is filled by get-XSLT_NamespaceManager function that we will look at shortly.
To do the computations the xeval function iterates through the array of XPath expression that are passed in. It uses a clone of the XPathNavigator to execute the expression. The reason it uses a clone of the XPathNavigator is the XPathNavigator is really a cursor on the XML file and we want to leave that XPathNavigator in its original state for each execution of XPath expressions being processed.
filter get-XSLT_XPathNavigator
{
param ($xml)
if($xml -is [string])
{
$xml = get-XSLT_XMLReader $xml;
$xml = get-XSLT_XPathDocument $xml
}
$nav = $xml.CreateNavigator();
$nav
}
The get-XSLT_XPathNavigator uses the string passed into it as a file path. It starts by converting the file path into an XmlReader, then uses that XmlReader to make an XPathDocument, which in turn is used to make an XPathNavigator.
filter get-XSLT_XMLReader
{
param ([string]$xmlFile)
[System.IO.FileStream]$fileStream = new-object System.IO.FileStream $xmlFile,
([System.IO.FileMode]::Open),
([System.IO.FileAccess]::Read)
[System.Xml.XmlTextReader]$rdr = new-object System.Xml.XmlTextReader $fileStream
$rdr
}
The get-XSLT_XMLReader function opens a FileStream using the string passed in as the path to the file. Note that it is not using a StreamReader which would convert the file to text, it is instead reading the raw bytes in the file. The FileStream is used to make an XmlTextReader. Again, dispite its name, an XmlTextReader does not read text, it reads bytes from the FileStream and because it is a complient XML processor it is completely capable of determining the encoding of the XML that is in the byte stream.
filter get-XSLT_XPathDocument
{
param ([System.Xml.XmlReader]$xml)
$doc = new-object System.Xml.XPath.XPathDocument $xml;
$doc
}
The get-XSLT_XPathDocument function uses an XmlReader to make an XPathDocument. An XPathDocument is, in effect, a readonly XmlDocument except that the only thing you can do with it is make an XPathNavigator out of it. If all you are going to do is read the content of an XML file and not modify it, and XPathDocument may be a better choice because it may be more efficient at processing XPath than the XmlDocument.Select method is.
function get-XSLT_NamespaceManager
([System.Xml.NameTable] $nameTable, [hashtable] $namespaces)
{
$nm = new-object System.Xml.XmlNamespaceManager $NameTable
foreach($key in $namespaces.keys)
{
$nm.AddNamespace($key, $namespaces.$key);
}
,$nm
}
The get-XSLT_NamespaceManager has two inputs, a NameTable and a dictionary of namespace mapping. It starts by making an XmlNamespaceManager. It then iterates through the keys in the dictionary and uses the key and it associated value to add namespace mappings to the XmlNamespaceManager. Note that it uses the “,” operator when it returns the XmlNamespaceManager. The XmlNamespaceManager implements IEnumerable and returning it inside of an array prevents the XmlNamespace itself from being enumerated by PowerShell when it is returned, which is what we want.
Lastly the get-XSLT_LiteralXPathNavigator function is used to make an XPathNavigator out of literal XML.
filter get-XSLT_LiteralXPathNavigator
{
param ([string]$literalXml)
[xml]$xml = $literalXml;
$xml.CreateNavigator();
}
This is a case where assuming the XML is in fact text is ok, becuase it is text and we use the PowerShell implementation of XML to get a navigator from the string.
You probably have noticed the all of these functions have an “XSLT_” internal prefix in them but this blog article doesn’t show anything about the use of XSLT. There is more comming in blog articles that follow this one…
So where are we at? The xeval function can handle a lot of the kinds of processing that is typically done with XML and has none of the limitations that the PowerShell implemenation of XML does. You do have to learn a bit about XPath, the references that were sited earlier are a good place to start and there are XPath tutorials all over the web. YMMV, but typically the best way to process XML is to process XML rather than turn it into an object model. It will be worth you effort at learning XPath… after all it is easier to learn XPath than to learn Perl:-).