XmlSerializer sans XSD

This is for Chris...

You have some XML you want to process using XmlSerializer, but you don't have a schema to feed to xsd.exe in order generate CLR types. So what do you do? Write your serializable class by hand. It isn't very hard once you understand the basic mapping. Here's how it works.

1) An element maps to a class.

Consider the root of an RSS document, an element for which no XSD schema exists.
<rss version="2.0">...</rss>

Here's a corresponding class:

[XmlRoot("rss")]
public class Rss { ... }

The [XmlRoot] attribute says this class matches the root element of a document (or can be the beginning of a serialization/deserialization) with the qualified name {}rss (that is, the local name rss and no namespace URI).

An <rss> element contains a <channel> element:

<rss version="2.0">
  <channel>...</channel>
</rss>

This element becomes another class:

public class Channel { ... }

[XmlRoot("rss")]
public class Rss
{
  public Channel channel;
}

The name of the Rss.channel field implicitly matches the <channel> element. If the element was in a namespace or was spelled differently (even casing), you would use the [XmlElement] attribute to define the right mapping (there's an example of this coming shortly).
2) Simple elements with text only children can map to a class or a property/field.

The <channel> element contains several children:

<rss version="2.0"
     xmlns:dc="http://purl.org/dc/elements/1.1/">;
  <channel>
    <title>At Your Service<title>
    <link>http://www.pluralsite.net/tewald</link>
    <description/>
    <dc:language>en-US</dc:language>
    ...
  </channel>
</rss>

All of these children are elements with simple text content. Like all elements, you can map them to classes. Here's an example for the <title> element.

public class Title
{
  [XmlText]
  public string Value;
}

The [XmlText] attribute tells the serialization plumbing to map the text within the <title> element to the Value field. The Title class is used by the Channel class, as shown below.

public class Channel
{
  public Title title;
}

Following the "every element is a class" pattern at this level is a little unwieldy. To simplify things, you can represent an element with only text content as a field instead.

public class Channel
{
  public string title;
  public string link;
  public string description;
  [XmlElement(Namespace="http://purl.org/dc/elements/1.1/")]
  public string language;
}

(Note the use of [XmlElement] to specify a specific namespace for the language element, which comes from Dublin Core.)

3) Elements that appear more than once map to arrays/ArrayLists.
What about elements which appear more than once? For instance, the element contains multiple <item> elements, each of which describes an entry in the RSS feed.

<rss version="2.0"
     xmlns:dc="http://purl.org/dc/elements/1.1/">;
  <channel>
    <title>At Your Service<title>
    <link>http://www.pluralsite.net/tewald</link>
    <description/>
    <dc:language>en-US</dc:language>
    <item>...</item>
    <item>...</item>
    <item>...</item>
  </channel>
</rss>

Each {}item maps to an instance of a class.Here's an example:

public class Item
{
 public string title;
 public string link;
 public string description;
 [XmlElement(Namespace="http://purl.org/dc/elements/1.1/")]
 public DateTime date; // note the use of DateTime instead of string
}

(Note the use of DateTime instead of string to represent the contents of the <text> element. You can use any simple type you like to represent the text in an element or attribute; XmlSerializer will do the right thing.)

The Channel class needs uses the Item class this way:

public class Channel
{
  public string title;
  public string link;
  public string description;
  [XmlElement(Namespace="http://purl.org/dc/elements/1.1/")]
  public string language;
  [XmlElement("item")]
  public Item[] items;
}

Normally XmlSerializer maps an array or an ArrayList to an element representing the array and containing elements representing each item in the array. In RSS, however, the <item> elements simply appear "inline" within the <channel>, without any extra wrapper. Adorning the items field with [XmlElement("item")] tells the XmlSerializer plumbing to map the array to any <item> elements within <channel>, without looking for an additional wrapper.

(If you were processing an XML document that contained multiple instances of the same element withn a wrapper, you could either introduce a class for the wrapper element with an array for the contents. You could also use the [XmlArray] and [XmlArrayItem] attributes, which control the mapping to the wrapper element and item elements, respectively.)

4) Attributes map to properties/fields of the class that represents their owner element.

Most XML dialects use attributes in some way. In RSS, for instance, the <category> element within an <item> has an optional domain attribute. Here's an example:

<rss version="2.0"
     xmlns:dc="http://purl.org/dc/elements/1.1/">;
  <channel>
    <title>At Your Service<title>
    <link>http://www.pluralsite.net/tewald</link>
    <description/>
    <dc:language>en-US</dc:language>
    <item>
      ...
      <category domain="abc">xyz</category>
      ...
    </item>
    <item>...</item>
    <item>...</item>
  </channel>
</rss>

In this case, the <category> element must be mapped to a class instead of a property/field, because you need a place to store the value of the domain attribute. Here's the corresponding class:

public class Category
{
  [XmlAttribute]
  public string domain;

  [XmlText]
  public string Value;
}

The Item class would use the Category class this way:

public class Item
{
  public string title;
  public string link;
  public string description;
  [XmlElement(Namespace="http://purl.org/dc/elements/1.1/")]
  public DateTime date; // note the use of DateTime instead of string
  [XmlElement("category")]
  public Category[] categories;
}

An <item> can contain any number of <category> elements, so an array is used.

5) Remember, you don't have to deserialize everything.

The XmlSerializer plumbing does not do any schema validation at run time. (You can add this if you want to by feeding it a validating reader.) It happily ignores anything you aren't looking for. It's because of this that I think of XmlSerializer more as a query tool than a serialization engine. I don't know why, but the term "serialization", to me, implies completeness. XmlSerializer doesn't have to produce a complete object model for your data. It will happily give you just what you tell it you want. The query is expressed as a set of stylized CLR types with attributes that indicate how classes/fields/properties correspond to XML. (It's also interesting to think of XmlSerializer as a filter, because after you deserialize to pull out what you want, you can reserialize the result.)

If you want to capture elements that you weren't expecting, you can do it this way:

public class Item
{
  public string title;
  public string link;
  public string description;
  [XmlElement(Namespace="http://purl.org/dc/elements/1.1/")]
  public DateTime date; // note the use of DateTime instead of string
  [XmlElement("category")]
  public Category[] categories;
  [XmlAnyElement]
  public XmlElement[] elems;
}

In this case, the [XmlAnyElement] attribute on the elems field indicates that any elements within an <item> that aren't explicitly mapped to another field should be stored here.

Finally, you there is also a trick to determine whether an element or attribute actually appears in an XML document. This is important because when you map data to simple value types, they exist in your deserialized object model whether they appear in the source document or not. For instance, in an Item object, the date field will exist whether or not the corresponding <item> element doesn't contains a {http://purl.org/dc/elements/1.1}date element.

You can detect the presence of a particular piece of data in the source document by introducing an additional property/field boolean called xyzSpecified, where xyz is the data you are after. For instance, the Item class can be modified this way:

public class Item
{
  public string title;
  public string link;
  public string description;
  [XmlElement(Namespace="http://purl.org/dc/elements/1.1/")]
  public DateTime date;
  [XmlIgnore]
  public bool dateSpecified; // true if date appears in source doc, otherwise false
  [XmlElement("category")]
  public Category[] categories;
  [XmlAnyElement]
  public XmlElement[] elems;
}

So, to summarize: here's what you need to know to write your own XmlSerializable classes for arbitrary XML data, even when you don't have a schema:

1) Elements map to classes.
2) Simple elements with text only children can map to a class or a property/field.
3) Elements that appear more than once map to arrays/ArrayLists.
4) Attributes map to properties/fields of the class that represents their owner element.
5) Remember, you don't have to deserialize everything.


 


Posted Jun 18 2004, 09:40 AM by tim-ewald

Comments

Chris Sells wrote re: XmlSerializer sans XSD
on 06-28-2004 1:50 PM
Since this was written *just* for me, I have a number of detailed requests so that I can cache a link and always know how to do this thing I've been trying to figure about how to do in .NET for a coupla years now (and have been asking Tim to explain for most of that time : ):

1. At the end of section 2, I'd like to see the minimal .NET code sample that shows me how to deserialize an instance of the Rss type create thus far from an RSS document instance. I'd also like to see how to create an Rss instance and serialize it into legal RSS.

2. After section 3, I'd really love you to pretend that RSS had put item elements into an item element and show me how my class would be written to handle that case (as it is such a common one).

3. What's the XmlIgnore property do?

4. I'd really love to be able to download the sample app you were building in this piece as code that I can run and with which I can play.

5. Question: should I write my XML serialization code using public properties or public fields? I guess if I'm using the same data structure to create XML, a public property that updates <PropertyName>Specified would be useful. Is that taken into account when serializing, e.g. if FooSpecified is false, will Foo be skipped during the write?

6. Of course, I'd love an automated tool that read one or more instances and created this serialization code for me, complete with type inference and intelligence to notice when some data appears in date form, but not always, so the type should be a string, etc.

Thanks!
O(geek) wrote Cool XmlSerializer Features
on 06-28-2004 6:39 PM
O(geek) wrote Cool XmlSerializer Features
on 06-28-2004 6:39 PM
Markyologist wrote Tim Ewald's
on 07-27-2004 10:25 AM
Markyologist wrote Tim Ewald's XmlSerializer sans XSD
on 07-27-2004 3:37 PM
Roger Searjeant wrote re: XmlSerializer sans XSD
on 08-05-2004 8:07 AM
Great article: I decided to have a crack at writing something after reading it (and seeing Chris Sells' follow-up comments).
Go here: http://www.searjeant.net/weblog/archives/000065.html for a short piece and a link to the zip.

Is this the kind of thing you/Chris had in mind? Any/allconstructive comments would be really welcome.

Cheers,
Roger Searjeant.
Trevor Scurr wrote re: XmlSerializer sans XSD
on 09-03-2004 12:37 AM
Couldn't agree more! I've looked at several .NET examples where metadata or the like is stored in xml files and the programmer has written code to use XML parsing with all its nested case statements etc. And each time I think why don't you just write a serializable class and let the framework do the grunt for you. Particularly as the performance seems very good in usage.

On the issue of performance I was originally concerned that deserializing would come at a cost so I looked at cloning my deserialized object. Guess what - cloning uses serialization/deserialization anyway.

Anyway the prime reason for my post was to say there is a great treatise and reference work on the subject by Christian Schittko here : http://www.topxml.com/xmlserializer/serializer.PDF. Or you can find it as HTML here : http://www.topxml.com/xmlserializer/default.asp

Regards
Trev
mostly harmless wrote SansXSD - updated
on 10-18-2004 5:18 AM
I've updated the code I wrote in response to Tim Ewald's piece on inferring structure from XML documents (read it here). I posted a solution back in August, but it was very rough and ready. This is nicer, but is...
A. Skrobov wrote re: XmlSerializer sans XSD
on 08-29-2005 11:42 AM
Thanks for an excellent article!
Could you please enlighten me, is it possible to load object from an arbitrary XML node (rather than just the root node)? Or do I have to do some XSLT before passing my XML file to XmlSerializer?
Tim wrote re: XmlSerializer sans XSD
on 08-30-2005 5:03 AM
You can deserialize from an arbitrary node, the XmlSerializer doesn't care. This takes some doing if you're using Web services, but it is possible.
Stafford wrote re: XmlSerializer sans XSD
on 05-19-2006 8:54 AM
Thanks for an outstanding article. I am trying to be a code monkey like you. I was able to use it to create a class to deserialze fss feeds but for some reason I can't get the "pubDate".
Catalin Manoliu wrote re: XmlSerializer sans XSD
on 02-02-2007 4:37 AM
Why don't create the XmlSchema document manually and then use XSD.exe to generate the classes ?

Add a Comment

(required)  
(optional)
(required)  
Remember Me?