One symbol, multiple meanings

The Tagtriples scheme is working pretty well at work (which is turning out to be a good testbed for the (opensource) tagtriples code I write in my spare time). We've got a deployment of the aggregator with 1.5M triples in it.

I think I've got the semantics pretty much sorted:

  • Symbols used multiple times in the same graph are assumed to mean the same thing each time.
  • However the same symbol used in different graphs may or may not denote the same thing. (It's up to the reader to decide for themselves how likely this is when they interpret the graphs within a certain context.)

This sounds pretty woolly, but I've found it to be reasonably workable - it's easy to use descriptive statements to whittle down the set of meanings for a symbol within a certain context.

The remaining hurdle to this scheme occurs when you want to handle two meanings/senses of the same symbol in a single document.

E.g. maybe you want to say

"TradeBroker (the application) has project team TradeBroker (the team)."

This seems to arise most frequently when you are making statements to link information from two (un-connected) sources that use the same symbol to mean different things. E.g. in the above example the first 'TradeBroker' might come from an application database, and the second a people/groups ldap directory.

After much thought (mainly about whether this could be encoded using statement proximity), I've decided that the most workable way I can think of is to add an extra graph-specific field for each value in a statement.

In practice this means the addition of 3 new fields to the triples table: i.e. in addition to the s,p,o fields for each statement, there's also a graph specific (g)s, (g)p, (g)o that allows differentiation between different 'uses' of the same symbol within the same graph. This sounds heavyweight, but I'm hoping the new fields won't need indexing since they'll be used to filter already-small resultsets.

In serialized form, I'd imagine the above example would be expressed:

TradeBroker type Application TradeBroker hasProjectTeam TradeBroker[2] TradeBroker[2] type ProjectTeam

Hopefully I'll get a chance to implement this and try it out some time later this week.

Contextual disambiguation not a big problem?

I've been struggling with workable strategies to overcome the symbol-ambiguity problem, both in RDF (which can easily suffer from context skew), and also in my TagTriples scheme which just uses symbols/words instead of URIs to identify things and is thus more susceptable to ambiguity.

Had a bit of a revelation moment on the train yesterday:

Hypothesis: For KM applications, disambiguation of symbols (and URIs) into their contextual meanings does not necessarily need to be done up-front, either at the authoring or at the aggregation stage.

This is based on observations of the three principle information consumption usecases at work: Structured-Querying, Searching and Browsing.

I noticed that:

People authoring structured queries (think SQL, Sparql) know a lot about the structure of the information they want to query, and are easily able to derive the meaning of symbols from their context even when there is ambiguity. This is because they are (1) human, and (2) they understand the domain they are querying within.

But more importantly, structured query languages then give them the expressive power to capture this symbol-disambiguation reasoning by including context and scope patterns in the query.

e.g. There may be two 'BondTraders' as the result of aggregating from multiple contexts (maybe one is a software project and the other the deployed production software itself). However only one of them is running on a cluster of machines spread across two datacentres, and so the query factors out the other sense of the term.

This has an important ramification - it means potentually that for structured query clients, the ambiguity can be left in the aggregated data.

Ok - so no need for the aggregator to disambiguate up-front for structured-query clients. This leaves clients who are searching and browsing.

The searching and browsing activities are interesting because they often occur when the client (human or software) is hoping to discover new information. The client doesn't necessarily know the structure of the data, and maybe doesn't even fully understand the domain of the information in which it is searching and navigating. The client is therefore poorly equipped to differentiate the meanings of ambiguous symbols.

However, the other characterisic of this activity is that it usually involves a narrow-band, iterative interface.

E.g. 'Searching' clients (think google) are characterized by using a simple interface, maybe iteratively, to narrow down the search space until they can latch onto a small number of results that are close enough to their goal.

'Browsing' clients (think web browser, URIQA) retrieve (at most) a handful of nodes at once, executing small iterative traversal queries as they navigate carefully through new data and structure.

In both of these modes, I'd speculate that the size of knowledge chunks retrieved are small enough for the computer to apply disambiguation analysis strategies in real time.

E.g. the statement

BondTrader productionServer FooBahServer12

..could be statistically disambiguated in real time by looking at the scope of the document supplying this information. If most of the subjects having property 'productionServer' map to external symbols that have 'type Application', then it follows that this BondTrader is probably also an application. So the link from BondTrader can point to a chunk of aggregated BondTrader information from documents/graphs describing applicationy things (rather than projecty things).

Moreover, the interface is hands-on and iterative (even for software clients), potentually allowing the client to tune the disambiguation on the fly - adjusting the error rate and trading off false-negatives in the process.

So to conclude: It looks to me, contrary to my earlier intuition, that symbol/context disambiguation can be delayed until the data is closer to the client. This is good news because it means the information can be aggregated without processing and transformation (and thus automatically without potentual signal-loss/skewing), and also that the disambiguation process can be iterative, dynamic, and tailored to the client context.

Priorities

Having problems sleeping again - a familiar pattern: I wake up, I start thinking about stuff, I don't sleep again until I write something down.

OK, semantic web - What's useful/important now?:

  1. Getting data
  2. Aggregating data
  3. Querying the aggregated data

What's not so important?

  1. Getting people to agree on how to serialize the data
  2. Getting people to agree on how to model the data

The thing is, I've been assuming that the latter two are important for achieving the goals in the first set, so much so that I confused them with being the goals.

Shift in emphasis:

Instead of coming up with a data format that's easy to aggregate (e.g. rdf, tagtriples), maybe concentrate on making tools that help in aggregating the existing stuff.

Instead of requiring the data publisher to work to fit the model, shift the work onto the person aggregating the data, and give them better tools to do it. Or maybe even the person querying the data (since they have the most to gain).

And make it easier to publish, even if it means more difficult to aggregate and query.

(Actually, what's most important is me getting some sleep).

Is Identity in the eye of the beholder?

A commonly held principle of identity in within RDF circles is that the owner of the URI gets to pick what it identifies. That sounds perfectly reasonable in theory, but unfortunately my experience with using RDF at work suggests that in practice the meaning of a URI tends to skew a bit with usage and context.

For example, at work we exported the ldap directory of employees as rdf, generating a series of URIs in the process. These URIs were then used in other graphs to connect other data to employees. In this data, sometimes the URI is used to identify the user-entry in the directory, and sometimes to denote the person itself. (e.g. probably app:BondTrader isn't administered by the person:Frank ldap directory entry, it's administered by the person it denotes).

This is of course a matter of precision, and in theory we could have one set of URIs to denote the people, and another set to denote the directory entries. But the problem is that, to a greater or lesser extent, this seems to happen all the time.

E.g. We've had similar happenings with dns-name vs server - i.e. sometimes the concept of dnsname blurs with the server it points to (e.g. if you're in an app team), and sometimes it doesnt (if you look after dns). The properties of dnsname make sense in either context - e.g. you can think of an alias in terms of a server quite easily.

Off the top of my head, others include:

  • application vs monitoring-configuration-entry
  • windowslogin vs person
  • application (software) vs project

So what's the solution? Should we be murderously precise about what we mint, or do these vague overlapping concepts have their place?

This is especially interesting to me because in tagtriples the concept of identity is a little blurry anyway. Identity tends to get built up by description rather than by minting an 'id', and so skews according to the context it's being used in.

This blurryness has lead to me consider putting the final responsibility of identity on the shoulders of the 'aggregator' (i.e. the person doing the aggregating) rather than the author of the data. It's a compelling solution since the person doing the aggregating is collecting and merging data for a particular purpose under a particular context.

E.g my tagtriples aggregator at work is used to collect data for the purpose of managing applications in DRKW, and thus all the data is specific to DRKW. If, say, DRKW was ever merged with another investment bank I'd need to reconcile this data, and so would probably add 'drkw' tags to all of the existing application data in order to manage collision with the new data.

Using this mechanism I could also get to choose (to some degree) whether the dnsname 'ab35622abc' is different to the server 'ab35622abc' for the context I'm aggregating in.

Implementing graphs as triple ranges

Had an implementation thought on the train this morning:

Instead of implementing a graph as a single id in the 4th column of the quads table, how about implementing it as a range of triple ids. i.e. Each triple is given a sequential id in the 4th column when asserted, and the graphs are stored as (startid, endid).

Advantages:

  • subgraphs can also be captured (useful for 'quoting')
  • the order of the graph is preserved (useful for retaining graph order for e.g. signing, and handling ordered collections without requiring an ordered collection type)
  • the number of statements in a graph is implicitly stored (endid - startid)
Disadvantages:
  • Cant use a UNIQUE(s,p,o,g) index to assert that the same statement doesnt occur twice in a graph. (Is this a problem? not really - can put up with duplicates)
  • Graphs can only be asserted one at a time. (or could lease a block of ids or segregate the id space)
  • Store size limited by length of id field. (actually I suspect other factors will limit scalability before this, and it could be made 64 bit)
  • Queries involving SOURCE become more tricky to implement internally. (WHERE triple.graph < g.start AND triple.graph >g.end)
  • Can't add triples to an existing graph (well, not without re-allocating ids or maybe using a cluster field in the graph table or something)

Hmmm... Subgraphs and preserved ordering are compelling advantages - this is sounding like a good tradeoff to me...

more disambiguation strategies

Following up on the disambiguation thing, once the aggregator knows that it's dealing with multiple senses of the same tag, the strategies become much more obvious. Here's a couple of ideas:

Indicating other similar usages

Am thinking that a graph could 'import tag-senses' from another graph. I.e. effectively saying 'if you're trying to 2nd-guess what I'm talking about, I probably mean the same as things in this document'. Something like: http://my/document/ similarVocabTo http://some/other/document/

This strategy could be used to e.g. import well-defined tag based ontologies and vocabularies.

Statistical Analysis

Once you have enough data, another potentual tool for disambiguation is statistical analysis (once you know there is ambiguity). E.g. you could use statistics to guess the cardinality of a property (to a certain level of error). Something like : 'eyecolour is something that practically all resources only appear to have one of. Therefore the PhilDawes with brown eyes is likely to be different to the PhilDawes with blue eyes.'

Overdose

Have been feeling poorly today. On the train back to birmingham I took an overdose of sudafed purely by accident (I always take 2 tablets. You're only meant to take one with sudafed). Don't think it did me any harm, and while I was sniffling I had some ideas on disambiguating tag senses.

How to disambiguate tag senses?

Have just read the tagwebs page (thanks Seth). For me, the problem with all the meta-tag structured data ideas (including tripletags in its current form) is in disambiguation of 'senses' of a tag. The tagweb idea plays out well until you have multiple meanings for the same tag. E.g. Phil and Phil (2 different people). Or Sleeper, Sleeper and Sleeper.

The big question is, how do you disambiguate a tag into its senses*? Can statistical analysis help?

(*without loosing the simplicity and serendipity of the system)