Daniel Cazzulino : XPathNavigatorReader: reading, validating and serializing! (XmlReader/XmlTextReader over XPathNavigator)

XPathNavigatorReader: reading, validating and serializing! (XmlReader/XmlTextReader over XPathNavigator)

There are many reasons why developers don’t use the XPathDocument and XPathNavigator APIs and resort to XmlDocument instead. I outlined some of them with regards to querying functionality in my posts about how to take advantage of XPath expression precompilation, and How to get an XmlNodeList from an XPathNodeIterator (reloaded).

XPathNavigator is a far superior way of accessing and querying data because it offers built-in support for XPath querying independently of the store, which automatically gain the feature and more importantly, because it abstracts the underlying store mechanism, which allows multiple data formats to be accessed consistently. The XML WebData team has seriously optimized the internal storage of XPathDocument, which results in important improvents both in loading time and memory footprint, as well as general performance. This was possible because the underlying store is completely hidden from the developer behind the XPathNavigator class, therefore, even the most drastic change in internal representation does not affect current applications.

However, some useful features of the XmlDocument and XmlReader classes are not available. Basically, I’ve created an XmlReader facade over the XPathNavigator class, which allows you to work against either an streaming or a cursor API. I’ll discuss how the missing features are enabled by the use of the new XPathNavigatorReader class, part of the opensource Mvp.Xml project.

Examples use an XML document with the structure of the Pubs database.

Serialization as XML

Both the XmlDocument (more properly, the XmlNode) the and XmlReader offer built-in support to get a raw string representing the entire content of any node. XmlNode exposes InnerXml and OuterXml properties, whereas the XmlReader offers ReadInnerXml and ReadOuterXml methods.

Once you go the XPathDocument route, however, you completely loss this feature. The new XPathNavigatorReader is an XmlReader implementation over an XPathNavigator, thus providing the aforementioned ReadInnerXml and ReadOuterXml methods. Basically, you work with the XPathNavigator object, and at the point you need to serialize it as XML, you simply construct this new reader over it, and use it as you would with any XmlReader:

XPathDocument doc = new XPathDocument(input);
XPathNavigator nav = doc.CreateNavigator();
// Move navigator, select with XPath, whatever.

XmlReader reader = new XPathNavigatorReader(nav);
// Initialize it.
if (reader.Read())
{
Console.WriteLine(reader.ReadOuterXml());
// We can also use reader.ReadInnerXml();
}

Another useful scenario is directly writing a fragment of the document by means of the XmlWriter.WriteNode method:

// Will select the title id.
XPathExpression idexpr = navigator.Compile(“string(title_id/text())”);

XPathNodeIterator it = navigator.Select(“//titles[price > 10]”);
while (it.MoveNext())
{
XmlReader reader = new XPathNavigatorReader(it.Current);

// Save to a file with the title ID as the name.
XmlTextWriter tw = new XmlTextWriter(
(string) it.Current.Evaluate(idexpr) + “.xml”,
System.Text.Encoding.UTF8);

// Dump it!
writer.WriteNode(reader, false);
writer.Close();
}

This code saves each book with a price bigger than 10 to a file named after the title id. You can note that the reader works in the scope defined by the navigator passed to its constructor, effectively providing a view over a fragment of the entire document. It’s also important to observe that even when an evaluation will cause a cursor movement to the navigator in it.Current, the reader we’re using will not be affected, as the constructor clones it up-front. Also, note that it’s always a good idea to precompile an expression that is going to be executed repeatedly (ideally, application-wide).

XmlSerializer-ready

The reader implements IXmlSerializable, so you can directly return it from WebServices for example. You could have a web service returning the result of an XPath query without resorting to hacks like loading XmlDocument s or returning an XML string that will be escaped. XPathDocument is not XML-serializable either. Now you can simply use code like the following:

[WebMethod]
public XPathNavigatorReader GetData()
{
XPathDocument doc = GetDocument();
XPathNodeIterator it = doc.CreateNavigator().Select(“//titles[title_id=’BU2075’]”);
if (it.MoveNext())
return new XPathNavigatorReader(it.Current);

return null;
}

This web service response will be:

<XPathNavigatorReader>  
 <titles>  
 <title_id>BU2075</title_id>  
 <title>You Can Combat Computer Stress!</title>  
 <type>business </type>  
 <pub_id>0736</pub_id>  
 <price>2.99</price>  
 <advance>10125</advance>  
 <royalty>24</royalty>  
 <ytd_sales>18722</ytd_sales>  
 <notes>The latest medical and psychological techniques for living with the electronic office. Easy-to-understand explanations.</notes>  
 <pubdate>1991-06-30T00:00:00.0000000-03:00</pubdate>  
 </titles>  
</XPathNavigatorReader>  

XML Schema Validation

Imagine the following scenario: you are processing a document, where only certain elements and their content need to be validated against an XML Schema, such as the contents of an element inside a soap:Body. If you’re working with an XmlDocument, a known bug in XmlValidatingReader prevents you from doing the following:

XmlDocument doc = GetDocument(); // Get the doc somehow.
XmlNode node = doc.SelectSingleNode(“//titles[title_id=’BU2075’]”);
// Create a validating reader for XSD validation.
XmlValidatingReader vr = new XmlValidatingReader(new XmlNodeReader(node));

The validating reader will throw an exception because it expects an instance of an XmlTextReader object. This will be fixed in Whidbey, but no luck for v1.x. You’re forced to do this:

XmlDocument doc = GetDocument(); // Get the doc somehow.
XmlNode node = doc.SelectSingleNode(“//titles[title_id=’BU2075’]”);

// Build the reader directly from the XML string taken through OuterXml.
XmlValidatingReader vr = new XmlValidatingReader(
new XmlTextReader(new StringReader(node.OuterXml)));

Of course, you’re paying the parsing cost twice here. The XPathNavigatorReader, unlike the XmlNodeReader, derives directly from XmlTextReader, therefore, it fully supports fragment validation. You can validate against XML Schemas that only define the node where you’re standing. The following code validates all expensive books with a narrow schema, instead of a full-blown Pubs schema:

XmlSchema sch = XmlSchema.Read(expensiveBooksSchemaLocation, null);
// Select expensive books.
XPathNodeIterator it = navigator.Select(“//titles[price > 10]”);
while (it.MoveNext())
{
// Create a validating reader over an XPathNavigatorReader for the current node.
XmlValidatingReader vr = new XmlValidatingReader(new XPathNavigatorReader(it.Current));

// Add the schema for the current node.
vr.Schemas.Add(sch);

// Validate it!
while (vr.Read()) {}
}

This opens the possiblity for modular validation of documents, which is specially useful when you have generic XML processing layers that validate selectively depending on namespaces, for example. What’s more, this feature really starts making the XPathDocument/XPathNavigator combination a more feature-complete option to XmlDocument when you only need read-only access to the document.

Implementation details. Expand only if you care to know a couple tricks I did ;)

Implementation Goodies

If you wonder how did I implement it from XmlTextReader instead of ` XmlReader` , read on. If you just want to go straight to downloading and using it, you can safely skip this section.

Even in the face of the XmlValidatingReader bug, I found a workaround that works great. Luckily, the XmlTextReader is not a sealed class, so intead of inheriting from XmlReader, I inherited from it. I basically cheat it at construction time, passing an empty string to it:

public class XPathNavigatorReader : XmlTextReader { public XPathNavigatorReader(XPathNavigator navigator) : base(new StringReader(String.Empty)) …

Next, I override all the methods which are abstract on the base XmlReader, basically replacing all the functionality from the XmlTextReader. Next, I also replaced the functionality of ReadInnerXml and ReadOuterXml methods, which are new from the XmlTextReader:

public override string ReadInnerXml() { if (this.Read()) return Serialize(); return String.Empty; } public override string ReadOuterXml() { if (_state != ReadState.Interactive) return String.Empty; return Serialize(); }

They are both passthrough methods to the Serialize one that performs actual writing. I think you will be surprised by the following snippet. There’s no interesting or complex code here, and I basically use the same node writing feature I explained above:

private string Serialize() { StringWriter sw = new StringWriter(); XmlTextWriter tw = new XmlTextWriter(sw); tw.WriteNode(this, false); sw.Flush(); return sw.ToString(); }

This is a benefit of having a 100% reader implementation.
Another interesting thing in the implementation is that the XPathNavigator class provides separate handling of namespace attributes and regular ones (GetAttribute and GetNamespace), unlike the XmlReader, which exposes both simply as attributes. The reader MoveToFirstAttribute method checks for both cases, moving either to the first regular attribute or the namespace one:

public override bool MoveToFirstAttribute() { if (_isendelement) return false; bool moved = _navigator.MoveToFirstAttribute(); if (!moved) moved = _navigator.MoveToFirstNamespace(XPathNamespaceScope.Local); if (moved) { // Escape faked text node for attribute value. if (_attributevalueread) _depth–; // Reset attribute value read flag. _attributevalueread = false; } return moved; }

The same work is done in the MoveToNextAttribute:

public override bool MoveToNextAttribute() { bool moved = false; if (_navigator.NodeType == XPathNodeType.Attribute) { moved = _navigator.MoveToNextAttribute(); if (!moved) { // We ended regular attributes. Start with namespaces if appropriate. _navigator.MoveToParent(); moved = _navigator.MoveToFirstNamespace(XPathNamespaceScope.Local); } } else if (_navigator.NodeType == XPathNodeType.Namespace) { moved = _navigator.MoveToNextNamespace(XPathNamespaceScope.Local); } if (moved) { // Escape faked text node for attribute value. if (_attributevalueread) _depth–; // Reset attribute value read flag. _attributevalueread = false; } return moved; }

I also take into account that the ReadAttributeValue method call causes a reader to be moved into the attribute value, where the current node type becomes Text usually (there’s also the Entity resolution and references stuff). The documentation for the XmlReader.ReadAttributeValue method states that the depth is incremented, so I take into account that too. This is basically a matter of setting a flag:

public override bool ReadAttributeValue() { // If this method hasn’t been called yet for the attribute. if (!_attributevalueread && (_navigator.NodeType == XPathNodeType.Attribute

BaseURI

Depth

GetAttribute

IsDefault

Item

LookupNamespace

MoveToElement

MoveToNextAttribute

NamespaceURI

NodeType

QuoteChar

ReadAttributeValue

ResolveEntity

XmlLang

There’s some interesting information here! For example, neither class uses the ` XmlReader.HasValue property, or the GetAttribute or indexer (Item) to access attributes. Mostly, access to attributes is done either by calling MoveToFirstAttribute/MoveToNextAttribute or by gettting the AttributeCount and later using MoveToAttribute(int index)` for each, something like:

for (int i = 0; i < reader.AttributeCount; i++) { reader.MoveToAttribute(i); // Do something with it. }

I’ve seen other attempts at this issue (both for XPathNavigatorReader and NavigatorReader classes) that basically iterate attributes each time AttributeCount is retrieved, and do the same until the desired index is reached in MoveToAttribute(i), by calling MoveToNextAttribute() repeatedly. From the table above, I could see that was a pretty bad idea. Therefore, I store in an ArrayList (therefore accessible by index) the name and namespace of each attribute of the current node, cache it and return its length. When the MoveToAttribute(i) is executed, I retrieve he name/namespace combination through the list for the index specified, and simply call the MoveToAttribute native method in the navigator with these parameters. I think this is better, although I haven’t measured the difference.

As a final word on the implementation: I reviewed Aaron Skonnard attempt at this feature, but I discarded it because it’s XmlReader-based, didn’t handle attribute/namespace attribute manipulation the way I expected, etc. So I decided to just start from scratch. If you look at his and my code, you’ll see they’re quite different. I recall Don Box did something too, but XmlReader-based too..

As usual, if you just want the full class code to copy-paste on your project, here it is. I strongly encourage you to take a look at the Mvp.Xml project, as there’re other cool goodies there! using System; using System.Collections; using System.Collections.Specialized; using System.IO; using System.Xml; using System.Xml.Serialization; using System.Xml.XPath; namespace Mvp.Xml.XPath { /// <summary> /// Provides an over an /// . /// </summary> /// /// Reader is positioned at the current navigator position. Reading /// it completely is similar to querying for the /// property. /// The navigator is cloned at construction time to avoid side-effects /// in calling code. /// Author: Daniel Cazzulino, kzu@aspnet2.com /// See: http://weblogs.asp.net/cazzu/archive/2004/04/19/115966.aspx /// public class XPathNavigatorReader : XmlTextReader, IXmlSerializable { #region Fields // Cursor that will be moved by the reader methods. XPathNavigator _navigator; // Cursor remaining in the original position, to determine EOF. XPathNavigator _original; // Will track whether we’re at a faked end element bool _isendelement = false; #endregion Fields #region Ctor /// <summary> /// Parameterless constructor for XML serialization. /// </summary> /// Supports the .NET serialization infrastructure. Don't use this /// constructor in your regular application. [System.ComponentModel.EditorBrowsable(System.ComponentModel.EditorBrowsableState.Never)] public XPathNavigatorReader() { } /// <summary> /// Initializes the reader. /// </summary> /// The navigator to expose as a reader.</param> public XPathNavigatorReader(XPathNavigator navigator) : base(new StringReader(String.Empty)) { _navigator = navigator.Clone(); _original = navigator.Clone(); } #endregion Ctor #region Private members /// <summary> /// Retrieves and caches node positions and their name/ns /// </summary> private ArrayList OrderedAttributes { get { // List contains the following values: string[] { name, namespaceURI } if (_orderedattributes != null) return _orderedattributes; // Cache attributes position and names. // We do this because when an attribute is accessed by index, it’s // because of a usage pattern using a for loop as follows: // for (int i = 0; i < reader.AttributeCount; i++) // Console.WriteLine(reader[i]); // Init list. _orderedattributes = new ArrayList(); // Return empty list for end elements. if (_isendelement) return _orderedattributes; // Add all regular attributes. if (_navigator.HasAttributes) { XPathNavigator attrnav = _navigator.Clone(); _orderedattributes = new ArrayList(); if (attrnav.MoveToFirstAttribute()) { _orderedattributes.Add(new string[] { attrnav.LocalName, attrnav.NamespaceURI }); while (attrnav.MoveToNextAttribute()) { _orderedattributes.Add(new string[] { attrnav.LocalName, attrnav.NamespaceURI }); } } } // Add all namespace attributes declared at the current node. XPathNavigator nsnav = _navigator.Clone(); if (nsnav.MoveToFirstNamespace(XPathNamespaceScope.Local)) { _orderedattributes.Add(new string[] { nsnav.LocalName, XmlNamespaces.XmlNs }); while (nsnav.MoveToNextNamespace(XPathNamespaceScope.Local)) { _orderedattributes.Add(new string[] { nsnav.LocalName, XmlNamespaces.XmlNs }); } } return _orderedattributes; } } ArrayList _orderedattributes; /// <summary> /// Returns the XML representation of the current node and all its children. /// </summary> private string Serialize() { StringWriter sw = new StringWriter(); XmlTextWriter tw = new XmlTextWriter(sw); tw.WriteNode(this, false); sw.Flush(); return sw.ToString(); } #endregion Private members #region Properties /// <summary>See </summary> public override int AttributeCount { get { // When the user requests the attribute count, it’s usually to // use a for iteration pattern for accessing attributes. Therefore, // we force loading the attributes positions to prepare for // indexed access to them. This is done in the OrderedAttributes getter. return OrderedAttributes.Count; } } /// <summary>See </summary> public override string BaseURI { get { return _navigator.BaseURI; } } /// <summary>See </summary> public override int Depth { get { return _depth; } } int _depth = 0; /// <summary>See </summary> public override bool EOF { get { return _eof; } } bool _eof = false; /// <summary>See </summary> public override bool HasValue { get { return ( _navigator.NodeType == XPathNodeType.Namespace Finally, I imagine you could even think about loading an XmlDocument from an XPathNavigator using the XPathNavigatorReader… although can’t think of any good reason why would you want to do such a thing :S…

The full project source code can be downloaded from SourceForge .

Enjoy and please give us feedback on the project!

Special credits: the idea of a reader over a navigator isn’t new. Aaron Skonnard did an implementation quite some time ago, as well as Don Box (you’ll need to search the page for “XPathNavigatorReader”. Mine is not based on theirs, and has features lacking on them, but they came first, that’s for sure ;).

Check out the Roadmap to high performance XML.

/kzu

/kzu dev↻d