About UsCommunityTrainingContent DevelopmentContact

Blogs
Pluralsight
Course Schedule
Scott Allen
Craig Andera
Mark Baciak
Don Box
Keith Brown
John CJ
Tim Ewald
Jon Fancey
Jon Flanders
Vijay Gajjala
Kirill Gavrylyuk
Ian Griffiths
Martin Gudgin
Jim Johnson
John Justice
Mike Henderson
Joe Hummel
Matt Milner
Ted Neward
Fritz Onion
Brian Randell
Jeffrey Schlimmer
Aaron Skonnard
Dan Sullivan
Herb Sutter
Doug Walter
Jim Wilson
Mike Woodring

My Links
Home
Contact
Login

Blog Stats
Posts - 19
Stories - 0
Comments - 52
Trackbacks - 23

Archives
Mar, 2007 (1)
Feb, 2007 (1)
Nov, 2006 (4)
Oct, 2006 (3)
Sep, 2006 (1)
Aug, 2006 (2)
Jul, 2006 (2)
Apr, 2006 (1)
Jan, 2006 (1)
Dec, 2005 (1)
Sep, 2005 (2)

Post Categories
PowerShell(rss)
SQL Server(rss)
XML(rss)


.NET, XML, SQL and Doing Things as Time Allows

PowerShell has builtin support for XML, but the System.Xml namespace offers many additional capabilites for processing XML. This article looks at using System.Xml in PowerShell. This article assumes you know some of the basics of PowerShell programming and are familiar with the System.Xml namespace in .NET.

First of all the [xml] data type variable in PowerShell is an instance of an XmlDocument. Typically an [xml] variable is used by assigning a variable to it. For example:

PS C:\demos> [xml]$xmldata = "<order><line price='100' qty='3'>hammer</line></order>"

We are going to look at a different way to load XML into an [xml] variable.

PS C:\demos> $xmldata = new-object "System.Xml.XmlDocument"
PS C:\demos> $xmldata.LoadXml("<order><line price='100' qty='3'>hammer</line></order>")
PS C:\demos>

This way of initializing a variable produces the same result as assigning a string to an [xml] variable, but does not create a new instance of an XmlDocument. We don’t want to create a new instance of an XmlDocument so we can leverage XPathNavigators and XPathExpressions but that will become evident later.

You can get back the XML in text form by using the get_InnerXml() method.

PS C:\demos> $xmldata.get_InnerXml()
<order><line price="100" qty="3">hammer</line></order>
PS C:\demos>

If you were programming in C# you would use the InnerXml property of the $xmldata variable to retrieve the InnerXml, but you must use the underlying get_InnerXml method in PowerShell to do the same thing. This will be the case for the other properties in XmlDocument too, the properties are not available by just using their name as the are in C#.

The root of an XML document is called the DocumentElement and for our document the name the root element is “order”. You can get a reference to it through the DocumentElement property and use its Name property to find its name.

PS C:\demos> $xmldata.get_DocumentElement().get_Name()
order
PS C:\demos>

Here we used the get_DocumentElement method to get the DocumentElement and the get_Name method to get its name.

You can modify an XmlDocument by adding or removing XML nodes. A node is part of XML, for example an element is a node as is an attribute. Let’s add another line to our order.

PS C:\demos> $line = $xmldata.CreateElement("line")
PS C:\demos> $line.SetAttribute("price", 23)
PS C:\demos> $line.SetAttribute("qty", 4)
PS C:\demos> $line.set_InnerText("nail")
PS C:\demos> $d =$xmldata.get_DocumentElement().AppendChild($line)
PS C:\demos> $xmldata.get_InnerXml()
<order><line price="100" qty="3">hammer</line><line price="23" qty="4">nail</line></order>
PS C:\demos>

Elements for an XmlDocument are not created using a constructor. Instead a technique called a factory method is used to create them. This is typical of almost all XML processors on any platform. An XmlDocument contains factory methods to create the various kinds of nodes you find in an XmlDocument. To create a new line element we use the XmlDocument.CreateElement factory method.

An XmlElement has a SetAttibute that is used to add attibutes to that element. We use the SetAttribute to add a “price” and “qty” attibute to the line element we created. SetAttribute is really a shortcut method. We could use CreateAttribute and SetAttributeNode instead, but SetAttribute is more straightforward.

The content of the new line element is set using the set_InnerText method. Again, if you were programming in C# you would assigning the InnerText property a value, but for an XmlDocument you must use the set_InnerText method. This will be true for the other assignable properties in XmlDocument too.

Creating a element using a factory method does not add that element to the document. We use the AppendChild method of the DocumentElement of the document to add the line. Append child always makes the added element the last child. PrependElement will also add an element but will make it the first child element. AppendChild always returns a reference to the element that was appended. To prevent that returned value from “leaking” out of the script we capture it in the dummy $d variable.

Last we use the get_InnerXml method of the XmlDocument see that we have in fact added a new line element to the document.

Now that we have a document let’s do some processing of it. Each line has a price and qty attribute and the product of these two attributes is called the extended price. The value of an order is the sum of all of the extended prices in it. So let’s calculate the value of an order. In this example we are going to use the native [xml] support built into PowerShell.

To start with let’s just calculate the extended prices.

PS C:\demos> $xmldata.order.line | %{$_.price * $_.qty}
100100100
23232323
PS C:\demos>

Somehow the results don’t really look correct. Data in an XML document my be untyped or strongly typed. Untyped doesn’t really mean the data doesn’t have a type, it just means that each piece of data is considered to be a string even if it looks like a number. Strongly typed XML is produced by validating an XML document against an XML Schema. In this case the types of the pieces of data are known because they are defined in the XML Schema. Sometimes a validated XML document is called the Post Schema Validation Instance or PSVI.

Our $xmldata XML document is untyped so the price and quantity are considered to be strings. When the ‘*’ operator is used with strings the string on its right is converted to an integer, or produces an error if it cannot be converted. The value of this integer is used to replicate and concatonate the string on the left of the operator. That is why we see 100 repeated three times, the value of qty for the first line is “3”.

We have to cast the price to a double to get what we want.

PS C:\demos> $xmldata.order.line | %{[double]$_.price * $_.qty}
300
92
PS C:\demos>

Now we can use the sum function example from my previous blog article, PowerShell and XML and SQL Server, to find the value of the order.

PS C:\demos> function sumOrder {
>> begin {$value = 0}
>> process { $value += [double]$_.price * $_.qty}
>> end {$value}
>> }
>>
PS C:\demos> $xmldata.order.line | sumOrder
392
PS C:\demos>

We can use the SelectNodes method of XmlDocument to get the same result. It makes use of an XPath expression which is a bit more flexible, though more complicated, than the dotted syntax that PowerShell provides for [xml] variables.

PS C:\demos> $xmldata.SelectNodes("//line") | sumOrder
392
PS C:\demos>

The SelectNodes method returns a set of XML nodes from a document that meet some criterion specified by the XPath expression. This particular XPath expression returns all the line elements in the document. When a set of XML nodes is put into a pipeline PowerShell passes each one of the nodes one at a time into the pipe.

One of the nice things about using XPath is that you can bury a lot of selection logic right into the XPath expression. What if we want to know the value of only the expensive items in our order? Our definition of expensive is when the price is more than 99.

PS C:\demos> $xmldata.SelectNodes("//line[@price>99]") | sumOrder
300
PS C:\demos>

Here our XPath expression has a predicate, “[@price>99]”, that filters out any lines whose value is 99 or less. The following is the equivalent using the XML capabilities built into PowerShell.

PS C:\demos> $xmldata.order.line | ?{[double]$_.price -gt 99} | sumOrder
300
PS C:\demos>

Notice that in this case it was important to cast the price as a [double] otherwise PowerShell would have taken the filter to check to see if the price lexically sorted after the string “99”.

An important thing about XPath is its universality. The XPath expression we have used in this example, technically called LocationPath, is a criterion for selection. Virtually every language and every platform supports XPath. You can pass the XPath expression we used in this example, “line[@price>99}”, to almost any other program and it will select the same lines for processing as we did in this example.

You have to be careful reading XML. For example below is an XML file that is encoded as big endian UTF-16. You can’t see the actual encoding on this page but you can download this test file from http://www.pluralsight.com/dan/samples/PSXml.zip if you want to try it out.

<?xml version="1.0" encoding="UTF-16BE"?>
<Test/>

The get-content command is a way to read the content of a file. For example you might try to read the sample file into a builtin [xml] datatype in PowerShell like this:

PS C:\demos> [xml]$x = get-content "c:\demos\testdocs\test.xml"
Cannot convert value "System.Object[]" to type "System.Xml.XmlDocument". Error:
 "Root element is missing."
At line:1 char:8
+ [xml]$x  <<<< = get-content "c:\demos\testdocs\test.xml"
PS C:\demos>

What PowerShell is doing here is reading in test.xml as string, then assigning that string to then $x variable. Unfortunately when it does this it has to make a guess about the encoding of the file because I didn’t tell it what it was and it guessed wrong. In fact if I just ask it to read the file and tell it I just don’t know the encoding it will generate a lot of unknown characters because of an incorrect guess about the encoding of the file.

PS C:\demos> get-content "c:\demos\testdocs\test.xml" -encoding Unknown
??????????????????????????????????????????????????????
PS C:\demos>

However I happen to know the encoding for the file, as I said it big endian UTF-16, so I can do this:

PS C:\demos> [xml]$x = get-content "c:\demos\testdocs\test.xml" -encoding BigEndianUnicode
PS C:\demos> $x.get_InnerXml()
<?xml version="1.0" encoding="UTF-16BE"?><Test />
PS C:\demos>

Now we are able to read our big endian UTF-16 file. But this defeats one of the most important features of XML; You can read an XML file without knowing the encoding.

Fortunately because PowerShell supports all of the .NET framework we can get around this problem and read any XML file that the underlying .NET Framework can handle without knowing its encoding.

PS C:\demos> $doc = new-object "System.Xml.XmlDocument"
PS C:\demos> $doc.Load($filePath)
PS C:\demos> $doc.get_InnerXml()
<?xml version="1.0" encoding="UTF-16BE"?><Test />
PS C:\demos>

Here we initialize $doc as an XmlDocument then use the Load function an XmlDocument to load in the file. The argument for Load is a string that can be either a file path or a URL. This is the recommended way to load an XML document into a variable because you shouldn’t depend on knowing what encoding of an XML document is.

Now that we can read XML lets process some XML files. Microsoft Word 2003 can be saved as XML. We have a few files that have been saved this way.

PS C:\demos> dir c:\demos\testdocs\*.xml
    Directory: Microsoft.PowerShell.Core\FileSystem::C:\demos\testdocs
Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---        10/30/2006  10:29 AM        108 betest.xml
-a---        10/30/2006   9:29 AM      24982 Test Document 1.xml
-a---        10/30/2006   9:33 AM      29195 Test Document 2.xml
-a---        10/30/2006   9:30 AM      25105 Test Document 3.xml
-a---        10/30/2006  10:05 AM        108 test.xml
PS C:\demos>

Actually some of the files in this directory are not Office documents, so we need a way to distinguish them. All Office XML files have on thing in common, they start with something called a processing instruction that looks like:

<?mso-application progid="Word.Document"?>

Let’s build a filter that will skip over the files that are not Word documents.

PS C:\demos> dir c:\demos\testdocs\*.xml | ?{$x = new-object "System.Xml.XmlDocument";
>> $x.Load($_.FullName);
>> $x.SelectSingleNode("processing-instruction('mso-application')")}
>>
    Directory: Microsoft.PowerShell.Core\FileSystem::C:\demos\testdocs
Mode                LastWriteTime     Length Name
----                -------------     ------ ----
-a---        10/30/2006   4:29 PM      32448 Test Document 1.xml
-a---        10/30/2006   4:11 PM      29935 Test Document 2.xml
-a---        10/30/2006   9:30 AM      25105 Test Document 3.xml
PS C:\demos>

Here we use the full path name for each file to load an XmlDocument. Then we use the SelectSingleNode method of XmlDocument to see if we can find the processing instruction we are looking for. If the processing function isn’t found then the file name is not passed out of the pipe so it does not get listed.

The XPath expression we used was bit more complicated than the first one we tried. If you are interested in an interactive tool for working with XPath you can download Aaron Skonnard’s XPath expression builder from http://www.pluralsight.com/toolcontent/xpath-expression-builder-4.zip. Also these test documents can be found at http://www.pluralsight.com/dan/samples/PSXml.zip.

Lastly we would like to show that these Word documents have been processed by PowerShell. After you open a document in Word if you go to File->Properties->Custom you will see that you can add custom properties of you own design to a word document. We would like to add a PowerShell custom property that indicates when the document was processed by PowerShell. These properties are embeded into the XML for the Word document.

Another thing about Word documents that we haven’t you looked at is that they make heavy use of XML namespaces. So before we try anything with a complete Word document let’s look at simple document that has namespaces in it.

PS C:\demos> [xml]$x = '<w:wordDocument
>> xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
>> xmlns:o="urn:schemas-microsoft-com:office:office">
>> <o:CustomDocumentProperties>
>> </o:CustomDocumentProperties>
>>  </w:wordDocument>'
>>
PS C:\demos> $x.get_InnerXml()
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:o="urn:schemas-microsoft-com:office:office"><o:CustomDocumentProperties><
/o:CustomDocumentProperties></w:wordDocument>
PS C:\demos>

This document is a mini-Word document with all the things we don’t care about stripped out of it.

Custom properties for a Word document are contained in a CustomDocumentProperties element from the “urn:schemas-microsoft-com:office:office” namespace. If the Word document doesn’t have any custom properties it will not have this element. So we will need a way to check to see if that element in the document. Let’s test our mini-Word document to see verify we can find it.

PS C:\demos> $custDoc = "/*/*[local-name()='CustomDocumentProperties' and namespace-uri()='urn:schemas-microsoft-com:office:office']"
icrosoft-com:office:office']
PS C:\demos> $x | ?{$_.SelectSingleNode($custDoc)}
wordDocument
------------
wordDocument

Here we have made an XPath expression and saved it into a variable so we can easily reuse it. It looks for an element whose name is CustomDocumentProperties and is in the office namespace. We can use it in a simple filter test and see that our test document can get through the filter.

Next let’s look at adding the CustomDocumentProperites if it is not there. First of all we will need an element to add.

PS C:\demos> $props = $x.CreateElement("CustomDocumentProperties", $custDoc)
PS C:\demos> $props.get_OuterXml()
<CustomDocumentProperties xmlns="/*/*[local-name()='CustomDocumentProperties' and namespace-uri()='urn:schemas-microsoft-com:office:office']" />
PS C:\demos>

To create an element in a particular namespace we use the second parameter of the CreateElement method to specify the desired namespace. To check to see if we got what we wanted we use the OuterXml property… there is no InnerXml for an element with no content. Let’s make a test document without a CustomDocumentProperties and try adding this element.

PS C:\demos> $x2.get_InnerXml()
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"/>
PS C:\demos> if ($x2.SelectSingleNode($custDoc)){}else
>> {
>> $props = $x2.CreateElement("CustomDocumentProperties", "urn:schemas-microsoft-com:office:office")
>> $x2.get_DocumentElement().AppendChild($props)
>> }
>>
PS C:\demos> $x2.get_InnerXml()
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
<CustomerDocumentProperties xmlns="urn:schemas-microsoft-com:office:office" /><
/w:wordDocument>
PS C:\demos>

Here we start off with a new XML document, $x2 that contains just a wordDocument. We use an if construct to test the $x2 to see if it contains CustomDocumentProperties element. If if it does not we create on and add it. Then we check to make sure the element was added.

This will be a useful for what we do next so let’s save it as a function.

function addCustomProps
{
$cust = $_.SelectSingleNode("/*/*[local-name()='CustomDocumentProperties' and namespace-uri()='urn:schemas-microsoft-com:office:office']")
if($cust){$cust}else{
$props = $_.CreateElement("CustomDocumentProperties", "urn:schemas-microsoft-com:office:office")
$_.get_DocumentElement().AppendChild($props)
}
}

Note that the addCustomProps function always returns a CustomDocumentProperties.

Now we have everthing we need to modify a Word document by adding a custom property to it.

PS C:\demos> $filePath = "C:\demos\testdocs\test document 1.xml"
PS C:\demos> $doc = new-object "System.Xml.XmlDocument"
PS C:\demos> $doc.Load($filePath)
PS C:\demos> $prop = $doc.CreateElement("PowerShell", "urn:schemas-microsoft-com
:office:office")
PS C:\demos> $prop.SetAttribute("dt", "uuid:C2F41010-65B3-11d1-A29F-00AA00C14882
", "string")
string
PS C:\demos> $prop.set_InnerText([System.DateTime]::Now)
PS C:\demos> $doc | %{addCustomProps} | %{$_.AppendChild($prop)}
dt                                      #text
--                                      -----
string                                  10/30/2006 16:07:57

PS C:\demos>
PS C:\demos> $doc.Save($filePath)

We start off by setting the $filePath variable to the path of a Word document. Next we load that Word document into the $doc variable.

We use the $doc variable to create a PowerShell element, and fill it out with the current time. We also add a dt attribute to specify that this is a string property and put the PowerShell element in the “urn:schemas-microsoft.com:office:office” namespace. Both of these are required for a custom property added to a Word document.

Finally we pass the $doc property through a pipeline our addCustomProps function. This function always returns the CustomDocumentProperty element so we can use the next segment of the pipeline to append our PowerShell property to it.

If you now open the “test document 1.xml” file in Word and navigate to its custom properties you will see that is now has a PowerShell property.

So we can use the full set of features available from the System.Xml namespace in .NET. The key to really making use of the is to become familiar with XPath. We really have just scratched the surface of its capabilities.

Dan

dan@pluralsight.com

posted on Monday, October 30, 2006 3:09 PM

  • # great
    airline tickets
    Posted @ 5/14/2007 11:17 AM
    Hi. Great site.
  • # re: PowerShell and XmlDocument
    ScriptRunner
    Posted @ 12/18/2007 1:13 PM
    Nice article, very helpful...
Title  
Name  
Url
Comments   
Please enter the code you see below. what's this?
This CAPTCHA image helps deter automated scripts that submit comment spam. In essence, it helps us determine that you are indeed a human instead of script.

 
   
 
© 2004 Pluralsight.
Visual Design by Studio Creativa
Privacy Policy