Global identifier schemes don’t scale II
Mar 9th, 2006 by Phil Dawes
Thanks to everybody that commented on my ‘global identifiers don’t scale‘ post (and especially for Seth and John’s responses). It’s obvious that I didn’t make myself very clear - sorry about that.
I think I made two main points:
The first was the scalability point (one I failed to make very well at all):
By ‘global identifiers schemes don’t scale’, I was referring to the scalability of the global identity approach itself. As the system gets big and less consistent, the global identifiers cannot guarantee to unambiguously identify things across the system, and end up being augumented with a secondary mechanism - context.
I.e. global ids don’t scale, but global ids + context do.
I think this is something we agree on. In the case of large RDF installations, the universal resource identifier (URI) system fails to unambiguously identify resources universally, and so in rdf store software it is augumented with a context/provenance system (e.g. quads).
This isn’t a theoretical issue with the model, it’s a practical one to do with the problems of maintaining identifier consistency in a decentralised universe.
The second point was a more contentious one: that a better approach is to relax the ‘universally unambiguous’ constraint for identifiers and embrace context for disambiguation (which we are forced to do to some degree anyway).
This is contentious mainly because it questions the utility of URIs which are the root of the w3c semantic web. In the rest of this post I’m going to address URIs specifically, because it’s easier to talk about specifics than generally. However I do think these arguments also apply to any global identifier scheme employed for a large decentralized information system.)
Ok, lets start with what we’ve got:
RDF URIs yield identities that we can pretty much guarantee to be consistent across an individual document or communication, but that are only ‘probably’ consistent with uses in other communications (documents/graphs) due to the decentralized nature of the semantic web. That probability can be evaluated based on the contents, provenance and context of the communication.
Now the scheme I’m advocating is one with relaxed constraints on universalness, instead scoping at the document or communication level.
This approach really differs only in the probability of collision across graphs: I.e. We can pretty much guarantee the identifiers to be consistent across an individual document or communication (we can make this a requirement), but the likelyhood that they are consistent with terms in another document relies much more heavily on the evaluation of the contents, provenance and context of the communication.
So my argument hinges on whether the benefit of a much higher probability of identifier consistency (but not guaranteed) is worth the high cost of using URIs. By way of ‘costs’, I listed some downsides in the last post which I won’t repeat here, suffice to say that I think these costs are crippling the uptake of the W3C semweb technologies.
On to the million dollar question: How do you deal with disambiguation in a world where the identifiers aren’t even remotely guaranteed to be consistent across documents?
The answer turns out to be simple - you use a bunch of identifiers in union to disambiguate the term. John Barstow’s question provides a convenient example:
“As for Phil’s belief that you can use database primary keys without worrying about global namespace collision - how do you do that again when a record has different ids in different databases? Oh, you prepend a namespace so I know which database I’m talking about?”
Actually no, you can disambiguate with extra terms:
E.g. the query:
‘What is the price of product ‘987′ from the foobah products database?’
can be pretty unambiguous to software agents of the foobah company. Here the terms ‘987′, ‘product’, ‘price’ and ‘foobah products database’ can all be used to narrow the scope and identify a resource.
This is essentually ‘identity by description’, and is a system which scales pretty well in both directions: communication localized to a narrow context require a minimal number of identifying terms and keep communication simple (e.g. ‘987′), wider context communication can utilise more terms to increase the accuracy over the possible matches/meanings.
An advantage of a model based on such a scheme is that it can support existing data automatically - it removes the need to create new ‘global’ identifiers for each thing being imported.
Also by implicitly allowing existing established identity schemes you get a good chance of serendipitious term sharing between documents - a high chance of being able to usefully merge data that was created without coordination.
So I would urge Semantic Web practitioners to consider the possibility of relaxing the ‘universal identifier’ constraints on identity: you still get to keep precision, but you also get vastly increased simplicity, lower barriers to entry and trivial access to all the existing data on the web! That’s a pretty good trade in my opinion.

[…] Phil continues his thoughts with the assertion that […]
I can see your point and think you’re right about the problems of unique identifiers, as identity is a matter of meaning, and meaning a matter of context (category theory, anyone?), for a global semantic web this raises the question how to describe context. Using unique ids to describe the shared context obviously leads to a vicious circle here, so some other means are necessary.
I think you’re absolutely right to consider URIs as fair game for re-evaluation, though I personally think they’re the best thing since sliced bread. I don’t know to what extent using quads globally might have helped, but I’ve a feeling it would be closer to shunted the problem along, rather than solving it. The fact that URIs have scaled on the current web suggests that there is something right here.
But when it comes to the SemWeb, you can *almost* already make the kind of no-URI descriptions you describe:
id::PhilDawes (weight 10st, date 24/12/2005)
id::PhilDawes (weight 10st 3lbs, date 26/12/2005)
consider id as a bnode, with associated properties maybe localId “PhilDawes”, weight “10st”, date “24/12/2005″ etc.
Danbri’s Identifying Things in FOAF post puts the case well for the indirected id idea.
However, there is quite a big *almost*. How do you identify the relationships between things? The value of “weight” property above is expressed as a string containing the units - that isn’t likely to be a very common use of the term “weight”. Although disambiguation of the same multi-reference form as the thing-identification could be done, if I remember correctly this led to one of the issues with the early Semantic Networks. You can build lovely big interconnected graphs, but then if you try to do anything with them (query, inference, transformation) you’ve got a problem. The use of global IDs for the relationships helps make these problems tractable.