Dark side of the semantic web

I haven't said anything much about semantic web stuff for a while as I've been occupied with other things. However Jim Hendler's 'Tales from the Dark Side' piece in IEEE Intelligent Systems reawoke an old interest. In short: I still think the RDF people have got it wrong with URIs, and so far nobody's convinced me otherwise.

My (same old) argument: URIs are bad for large-scale interoperability. The alternative: just use words and symbols occuring in real life, and use the context inherent in the communication to disambiguate meaning.

The interesting thing about the Hendler piece is that the it pretty much walks through the arguments I make for dropping URIs, but then avoids the conclusion:

If you and I decide that we will use the term "http://www.cs.rpi.edu/~hendler/elephant" to designate some particular entity, then it really doesn't matter what the other blind men think it is, they won't be confused when they use the natural language term "Elephant" which is not even close, lexigraphically, to the longer term you and I are using. And if they choose to use their own URI, "http://www.other.blind.guys.org/elephant" it won't get confused with ours. The trick comes, of course, as we try to make these things more interoperable. It would be nice if someone from outside could figure out, and even better assert in some machine-readable way, that these two URIs were really designating the same thing - or different things, or different parts of the same thing, or ... ooops, notice how quickly we're on that slippery slope. "

And this neatly sums up the situation with URIs. The low chance of collision represents a tradeoff: You get a high level of semantic precision - it's extremely unlikely that two parties will use the same URI to mean two totally unconnected things. You also get a very low level of semantic interoperability: it's equally unlikely that two unconnected parties will use the same URI to denote (even parts of!) the same thing.

Now I think the precision part is overrated - disambiguation of natural language terms can be tractably (and often trivially) achieved using contextual cues. However interoperablity of data from unconnected sources is really hard, and that's why I think this is a bad tradeoff.

Anyway, the crux of the Hendler piece is that for all the high level work going on in Semantic Web land (ontology languages, description logic), it's currently simple interoperability mechanisms that gain most traction and add the most value: 'a little semantics goes a long way'.

The piece implies (afaics) that this is where effort should be directed, and cites the example of matching FOAF data using email addresses as illustration of the potentual success of this approach. The matching heuristic is: if two FOAF resources are describing people with the same email address, they're very likely to be about the same person.

My experience concurs with the 'a little semantics goes a long way' sentiment, but personally I think FOAF has succeeded (for some measure of success) not because of RDF but in spite of it. I'd argue that the only reason the email matching works on a large scale is because email addresses are already concrete symbols grounded in the real world. FOAF didn't create them, it just provided a context for their use. FOAF's formal semantics certainly didn't create this interoperability - the largest example of foaf data is scraped from live journal's databases where the users creating the data have little concept of the ramifications of the 'http://xmlns.com/foaf/0.1/mbox' property.

If FOAF had to rely on artificial URIs as the sole means for identifying people it would struggle to gain any traction in the messy real world of the web.

However on the flip side I think FOAF would work just as well (and gain a lot more traction) if its underlying model didn't employ URIs at all and instead just used triples of words/symbols. Semantic web software would still be able to identify and index FOAF data: i.e. the symbol 'FOAF' is pretty unambiguous on its own, but even if it wasn't the juxtaposition of the symbol FOAF with properties like 'mbox', 'surname' etc.. would suffice for pretty accurate disambiguation.