Feed on
Posts
Comments

In the last post I mentioned importing XML into the JAM*VAT tagtriples store. One of JAM*VATs main features is that it can translate *any* XML into a tagtriples representation using some simple heuristics. I thought I’d better elaborate on this, especially as I haven’t documented the heuristics anywhere.

Before starting, I ought to point out that XML doesn’t have any explicit semantics rules associated with it - there’s no formula for converting a tree into ‘facts’ and you certainly can’t guarantee to know what an XML tree was intended to denote without some out-of-bounds information.

However there are a few common styles for representing data (e.g. rss2.0 style, atom, striped etc..) and with some simple heuristics I’ve found it’s usually possible to make a good [automated] stab at breaking it into statements that the author would agree with.

For example, when creating a simple xml tree:

 <person>
    <name>Phil Dawes</name>
    <email>pdawes@users.sf.net</email>
 </person>

..the semantics I’m attempting to articulate are: There is a person with the name ‘Phil Dawes’ and email address ‘pdawes@users.sf.net’.

The tagtriples statements generated by the heuristics are:

"Phil Dawes"[1] name "Phil Dawes"
"Phil Dawes"[1] email pdawes@users.sf.net
"Phil Dawes"[1] tag person

and I’d certainly agree with those statements.
(the symbols are suffixed to show that they mean different things - e.g. seperating “Phil Dawes” the name from “Phil Dawes” the person)

At this stage I feel obliged to give tagtriples a small plug and point out it has a number of properties that give it the ability to transform and hold a useful queryable, browsable representation of any xml (even xhtml).

- any symbol can be an identifier (not just URIs)
- statement order in a graph is maintained (as in XML)
- the ‘tag’ property has loose semantics (compared to RDF:type), making it feasible to generate tag statements automatically.
(e.g. a container could be reasonably tagged ‘people’, but you wouldn’t want to assert that it was a thing of rdf:type ‘people’).

(BTW, tag statements are used by JAM*VAT to loosely classify resources when browsing, and are similar to tagging in del.icio.us and flickr. The semantics of the statement can be interpreted as: ‘the symbol x is related to the resource, and can be used to help classify its meaning’).

OK, so the basic approach is:

1) Break the XML into a graph of anonymous nodes and properties.
(each XML element and attribute becomes a property)
2) Pick relevant symbols for the anonymous nodes
3) Pick relevant tags for the nodes
4) Clean up any redundant statements

The node symbol picking logic is:
2a) If the xml node has a single child text value, use that as the symbol.
2b) If the xml node contains child nodes, use the first text value occurance in the immediate child subtree as the symbol.

The node tag picking logic is:
3a) If the xml node has a parent node that is the only child node if its parent, then it is removed as a property and becomes a tag. (This heuristic handles ’striped xml’)
3b) if the node doesn’t have a tag via the above, tag it with the symbol of the parent xml element. (i.e. the element becomes a property and a tag).

A Simple Example
—————-

Using the simple XML tree fragment from above:

 <person>
    <name>Phil Dawes</name>
    <email>pdawes@users.sf.net</email>
 </person>

Applying rule (1) you get the following graph: (* = anonymous node)

 * ---person---> * ---name---> * ---_xmlvalue--> "Phil Dawes"
                 * ---email---> * ---_xmlvalue--> pdawes@users.sf.net

Which in triples is:

_:1 person _:2
_:2 name _:3
_:3 _xmlchars "Phil Dawes"
_:2 email :4
_:4 _xmlchars pdawes@users.sf.net

(_xmlchars is an artificial property which connects child characters to nodes)

Applying rules (2) to generate symbols you get:

_:1 person "Phil Dawes"[1]
"Phil Dawes"[1] name "Phil Dawes"[2]
"Phil Dawes"[2] _xmlchars "Phil Dawes"
"Phil Dawes"[1] email pdawes@users.sf.net[1]
pdawes@users.sf.net[1] _xmlchars pdawes@users.sf.net

And by applying rules (3) you get the ‘person’ tag:

_:1 person "Phil Dawes"[1]
"Phil Dawes"[1] name "Phil Dawes"[2]
"Phil Dawes"[2] _xmlchars "Phil Dawes"
"Phil Dawes"[1] email pdawes@users.sf.net[1]
pdawes@users.sf.net[1] _xmlchars pdawes@users.sf.net
"Phil Dawes"[1] tag person

Finally, it applies heuristics to trim the redundant statements.
In this case its the _xmlchars ones, because the <name> and <email> xml elements in the original document don’t have any attributes or child elements, and only have a single text value child.

_:1 person "Phil Dawes"[1]
"Phil Dawes"[1] name "Phil Dawes"
"Phil Dawes"[1] email pdawes@users.sf.net
"Phil Dawes"[1] tag person

And thats it!
BTW, you can experiment with importing XML into JAM*VAT. Unfortuanately there’s no way to cut-n-paste xml trees into it yet, so you have to find XML trees on the web. RSS and atom work well.

Viewing 1 Comment

    • ^
    • v
    Simplification of RDF/XML without URI constraint for subject, predicate or objects is a good idea. But I have to disaggree on this:

    "statement order in a graph is maintained"

    Because there is little semantic to be found in a sequence of XML elements and it's a big restriction on its practical usage.

    An XML element tree can represent sets of hierarchical relationships, display ordered sequence, but making an assumption about the meaning of a sequence restricts the liberty of the author to change this order of elements. That's exactly why XML is so popular: it is less complex than SGML and provides a more flexible text format than *fixed* position records or flat database interchange formats.

    The first example XML string with your name and e-mail address could be rewritten as:


    pdawes@users.sf.net


    explicitely assign the name "Phil Dawes" as the context of the e-mail address .

    Or vice-versa, using a well-supported URL,

    Phil Dawes

    A metabase (a semantic server, a metada management system) should not re-articulate the information submitted. It is up to the application or its users to produce well articulated statements.

    By the way, I tried to send you a private mail to but that did not work out so well. The mailer at sf.net bounced it back.

    Kind Regards,
close Reblog this comment
blog comments powered by Disqus