XML to tagtriples (and mapping heuristics)

In the last post I mentioned importing XML into the JAMVAT tagtriples store. One of JAMVATs main features is that it can translate any XML into a tagtriples representation using some simple heuristics. I thought I'd better elaborate on this, especially as I haven't documented the heuristics anywhere.

Before starting, I ought to point out that XML doesn't have any explicit semantics rules associated with it - there's no formula for converting a tree into 'facts' and you certainly can't guarantee to know what an XML tree was intended to denote without some out-of-bounds information.

However there are a few common styles for representing data (e.g. rss2.0 style, atom, striped etc..) and with some simple heuristics I've found it's usually possible to make a good [automated] stab at breaking it into statements that the author would agree with.

For example, when creating a simple xml tree:

 <person>
    <name>Phil Dawes</name>
    <email>pdawes@users.sf.net</email>
 </person>

..the semantics I'm attempting to articulate are: There is a person with the name 'Phil Dawes' and email address 'pdawes@users.sf.net'.

The tagtriples statements generated by the heuristics are:

"Phil Dawes"[1] name "Phil Dawes"
"Phil Dawes"[1] email pdawes@users.sf.net
"Phil Dawes"[1] tag person

and I'd certainly agree with those statements. (the symbols are suffixed to show that they mean different things - e.g. seperating "Phil Dawes" the name from "Phil Dawes" the person)

At this stage I feel obliged to give tagtriples a small plug and point out it has a number of properties that give it the ability to transform and hold a useful queryable, browsable representation of any xml (even xhtml).

any symbol can be an identifier (not just URIs)
statement order in a graph is maintained (as in XML)
the 'tag' property has loose semantics (compared to RDF:type), making it feasible to generate tag statements automatically. (e.g. a container could be reasonably tagged 'people', but you wouldn't want to assert that it was a thing of rdf:type 'people').

(BTW, tag statements are used by JAM*VAT to loosely classify resources when browsing, and are similar to tagging in del.icio.us and flickr. The semantics of the statement can be interpreted as: 'the symbol x is related to the resource, and can be used to help classify its meaning').

OK, so the basic approach is:

1) Break the XML into a graph of anonymous nodes and properties. (each XML element and attribute becomes a property) 2) Pick relevant symbols for the anonymous nodes 3) Pick relevant tags for the nodes 4) Clean up any redundant statements

The node symbol picking logic is: 2a) If the xml node has a single child text value, use that as the symbol. 2b) If the xml node contains child nodes, use the first text value occurance in the immediate child subtree as the symbol.

The node tag picking logic is: 3a) If the xml node has a parent node that is the only child node if its parent, then it is removed as a property and becomes a tag. (This heuristic handles 'striped xml') 3b) if the node doesn't have a tag via the above, tag it with the symbol of the parent xml element. (i.e. the element becomes a property and a tag).

A Simple Example

Using the simple XML tree fragment from above:

 <person>
    <name>Phil Dawes</name>
    <email>pdawes@users.sf.net</email>
 </person>

Applying rule (1) you get the following graph: (* = anonymous node)

 * ---person---> * ---name---> * ---_xmlvalue--> "Phil Dawes"
                 * ---email---> * ---_xmlvalue--> pdawes@users.sf.net

Which in triples is:

_:1 person _:2
_:2 name _:3
_:3 _xmlchars "Phil Dawes"
_:2 email :4
_:4 _xmlchars pdawes@users.sf.net

(_xmlchars is an artificial property which connects child characters to nodes)

Applying rules (2) to generate symbols you get:

_:1 person "Phil Dawes"[1]
"Phil Dawes"[1] name "Phil Dawes"[2]
"Phil Dawes"[2] _xmlchars "Phil Dawes"
"Phil Dawes"[1] email pdawes@users.sf.net[1]
pdawes@users.sf.net[1] _xmlchars pdawes@users.sf.net

And by applying rules (3) you get the 'person' tag:

_:1 person "Phil Dawes"[1]
"Phil Dawes"[1] name "Phil Dawes"[2]
"Phil Dawes"[2] _xmlchars "Phil Dawes"
"Phil Dawes"[1] email pdawes@users.sf.net[1]
pdawes@users.sf.net[1] _xmlchars pdawes@users.sf.net
"Phil Dawes"[1] tag person

Finally, it applies heuristics to trim the redundant statements. In this case its the _xmlchars ones, because the <name> and <email> xml elements in the original document don't have any attributes or child elements, and only have a single text value child.

_:1 person "Phil Dawes"[1]
"Phil Dawes"[1] name "Phil Dawes"
"Phil Dawes"[1] email pdawes@users.sf.net
"Phil Dawes"[1] tag person

And thats it! BTW, you can experiment with importing XML into JAM*VAT. Unfortuanately there's no way to cut-n-paste xml trees into it yet, so you have to find XML trees on the web. RSS and atom work well.