Namespacing & Context - ramifications for the semantic web
Apr 18th, 2008 by Phil Dawes
In determining the meaning of tokens used in communication there are two widely used approaches to disambiguate that I’ll charactise as ‘namespacing’ and ‘context’.
When humans communicate amongst themselves they use the context of the communication to narrow down the range of possible meanings of terms used in the exchange, and human language doesn’t employ namespaces at all. On the other hand computer identifier schemes typically use namespaces to prevent term clash, and don’t use context at all.
The mechanisms operate differently:
Namespaces:
- Every use of the namespaced term refers to the same concept.
(or at least if it doesn’t this is considered an error) - Deterministic
Context:
- the concept denoted to by the term depends on the context in which it is used
- Statistical ( is that the right word? )
My thinking is that namespaces work well in a closed environment because coordination overhead is low and deterministic programs are easy to write. Namespaced schemes do however require a management mechanism to ensure that each use of the same term denotes exactly the same thing. This works well if the terms are grounded in the system - e.g. on the www a URL is used to fetch a document, and thus its use as identifier for that document is grounded.
However the semantic web is an open environment with little grounding, which means that holistic term coordination and management isn’t practical. Thus web-scale semantic web systems need to employ some degree of context based disambiguation anyway - i.e. the system can’t globally merge statements together without considering issues of provenance and consistency. I wrote about this issue here and at present this consideration is usually handled manually by the person operating the RDF store or software, but as these systems grow and scale more of these issues will need to be addressed by software.
Note that it is important to distinguish between this and the issue of trust in the content of the communication - here I am purely talking about interpreting the meaning of the communication, specifically measuring term consistency between documents from disconnected sources.
Now if you take this this inevitable use of context at web scale as given, my question is: Could the semantic web bootstrap and scale better with a system that disambiguated entirely based on context and didn’t employ namespaces at all? (i.e. like human language communication).
So I’ve been thinking along the lines of a scheme where literals and bnodes are used in place of URIs in RDF documents. Vocabularies use literal terms in place of URIs, and the combination of terms are used to infer meaning in aggregate.
Non-determinism issues aside this approach does have a central advantage: it removes the coordination and bootsrap overhead associated with use of namespaced identifiers, and particularly with issues peculiar to URIs:
- artificial namespaces mean there’s little term match serendipity between disconnected uncoordinated clients
- pre-existing identifier schemes are commonly not valid URIs, making reuse difficult
- URIs introduce unnecessary term ownership, authority and lifecycle issues
- Other URI proprietary issues add to cognitive overhead: hash v slash, uri denotes document vs thing it describes
One particular advantage of the literals-in-combination approach is that data can be lifted from existing sources without the requirement to invent and translate identifiers into URI schemes. Currently translation of data into traditional RDF consists of two challenges:
- converting the structure of the data into a triple graph
- translating the identifiers into a URI scheme
Whereas the former is a one-shot deal for each data format, the latter frequently requires manual input for each document and is IMO the single biggest hurdle to putting data onto the semantic web.
Of course the downside of the approach is that software consuming the data needs to take a non-deterministic approach to term meaning. There is no globally correct answer to ‘does this term in this document mean the same as this one?’ - instead it is a function of both the context under which the documents were written and of the requirements of the querying client.
Unfortunately I suspect that as people try to get traditional w3c semweb technologies to scale up in web scale environments they’re going to find themselves in the same non-deterministic boat.
I’m experimenting with a literals and bnodes approach in my own software and will post updates to my blog.

My recent thinking re “semantic” data is that maybe a search engine is a better model for such systems than a triple store, which resonates with the points you are making.
A triple store requires very exact inputs to deliver any data at all. In contrast, a search engine’s motto is to “deliver results no matter what”.
Basically, indexing and (page-)ranking all “title literals” (wikinames, blog titles, …) should do the job, you probably wouldn’t need a fulltext search engine.
And let’s not forget the R in REST - a triple store has to return triples, whereas a search engine can return any representation, even a movie clip, which is a much better model for user facing systems.
Phil, should be an interesting experiment, looking forward to hearing how it goes (it sounds similar to the old Semantic Network stuff, which worked well for conceptual map kind of things, but not so good for practical data kind of applications).
One question - a Semantic Web without URIs in the RDF, where’s the Web bit? That an agent can follow it’s nose (do a GET) to find more information is a feature, no?
Manuel, the triplestore/search engine is a bit apples/oranges - most search engines have some kind of database under the hood, the database itself will require fairly exact input.
Going the other way, a triplestore with SPARQL can make a very simple backend to a search engine, e.g. to find anything with the word “primer” in it:
SELECT DISTINCT ?o
WHERE {
?s ?p ?o .
FILTER regex( ?o, ‘primer’, ‘i’ )
}
LIMIT 10
try pasting it in here:
http://hyperdata.org/sparql/demo/sparql-editor.html
Don’t know about doing away with URIs… But some kind of context seems essential when looking at data from different sources. Two resources with different URIs may be the same thing, and two resources with the same URI may be different. Moreover, the meaning of “same” itself is context dependent!
As I see it, URIs in RDF are 10% about disambiguating between terms, and 90% about allowing lookups of terms.
I suspect that by getting rid of the 10%, you also lose the 90%.