Global identifier schemes don’t scale II

Thanks to everybody that commented on my 'global identifiers don't scale' post (and especially for Seth and John's responses). It's obvious that I didn't make myself very clear - sorry about that.

I think I made two main points:

The first was the scalability point (one I failed to make very well at all):

By 'global identifiers schemes don't scale', I was referring to the scalability of the global identity approach itself. As the system gets big and less consistent, the global identifiers cannot guarantee to unambiguously identify things across the system, and end up being augumented with a secondary mechanism - context.

I.e. global ids don't scale, but global ids + context do.

I think this is something we agree on. In the case of large RDF installations, the universal resource identifier (URI) system fails to unambiguously identify resources universally, and so in rdf store software it is augumented with a context/provenance system (e.g. quads). This isn't a theoretical issue with the model, it's a practical one to do with the problems of maintaining identifier consistency in a decentralised universe.

The second point was a more contentious one: that a better approach is to relax the 'universally unambiguous' constraint for identifiers and embrace context for disambiguation (which we are forced to do to some degree anyway).

This is contentious mainly because it questions the utility of URIs which are the root of the w3c semantic web. In the rest of this post I'm going to address URIs specifically, because it's easier to talk about specifics than generally. However I do think these arguments also apply to any global identifier scheme employed for a large decentralized information system.)

Ok, lets start with what we've got: RDF URIs yield identities that we can pretty much guarantee to be consistent across an individual document or communication, but that are only 'probably' consistent with uses in other communications (documents/graphs) due to the decentralized nature of the semantic web. That probability can be evaluated based on the contents, provenance and context of the communication.

Now the scheme I'm advocating is one with relaxed constraints on universalness, instead scoping at the document or communication level. This approach really differs only in the probability of collision across graphs: I.e. We can pretty much guarantee the identifiers to be consistent across an individual document or communication (we can make this a requirement), but the likelyhood that they are consistent with terms in another document relies much more heavily on the evaluation of the contents, provenance and context of the communication.

So my argument hinges on whether the benefit of a much higher probability of identifier consistency (but not guaranteed) is worth the high cost of using URIs. By way of 'costs', I listed some downsides in the last post which I won't repeat here, suffice to say that I think these costs are crippling the uptake of the W3C semweb technologies.

On to the million dollar question: How do you deal with disambiguation in a world where the identifiers aren't even remotely guaranteed to be consistent across documents? The answer turns out to be simple - you use a bunch of identifiers in union to disambiguate the term. John Barstow's question provides a convenient example:

"As for Phil's belief that you can use database primary keys without worrying about global namespace collision - how do you do that again when a record has different ids in different databases? Oh, you prepend a namespace so I know which database I’m talking about?"

Actually no, you can disambiguate with extra terms: E.g. the query:

'What is the price of product '987' from the foobah products database?'

can be pretty unambiguous to software agents of the foobah company. Here the terms '987', 'product', 'price' and 'foobah products database' can all be used to narrow the scope and identify a resource.

This is essentually 'identity by description', and is a system which scales pretty well in both directions: communication localized to a narrow context require a minimal number of identifying terms and keep communication simple (e.g. '987'), wider context communication can utilise more terms to increase the accuracy over the possible matches/meanings.

An advantage of a model based on such a scheme is that it can support existing data automatically - it removes the need to create new 'global' identifiers for each thing being imported. Also by implicitly allowing existing established identity schemes you get a good chance of serendipitious term sharing between documents - a high chance of being able to usefully merge data that was created without coordination.

So I would urge Semantic Web practitioners to consider the possibility of relaxing the 'universal identifier' constraints on identity: you still get to keep precision, but you also get vastly increased simplicity, lower barriers to entry and trivial access to all the existing data on the web! That's a pretty good trade in my opinion.