Dark side of the semantic web

I haven't said anything much about semantic web stuff for a while as I've been occupied with other things. However Jim Hendler's 'Tales from the Dark Side' piece in IEEE Intelligent Systems reawoke an old interest. In short: I still think the RDF people have got it wrong with URIs, and so far nobody's convinced me otherwise.

My (same old) argument: URIs are bad for large-scale interoperability. The alternative: just use words and symbols occuring in real life, and use the context inherent in the communication to disambiguate meaning.

The interesting thing about the Hendler piece is that the it pretty much walks through the arguments I make for dropping URIs, but then avoids the conclusion:

If you and I decide that we will use the term "http://www.cs.rpi.edu/~hendler/elephant" to designate some particular entity, then it really doesn't matter what the other blind men think it is, they won't be confused when they use the natural language term "Elephant" which is not even close, lexigraphically, to the longer term you and I are using. And if they choose to use their own URI, "http://www.other.blind.guys.org/elephant" it won't get confused with ours. The trick comes, of course, as we try to make these things more interoperable. It would be nice if someone from outside could figure out, and even better assert in some machine-readable way, that these two URIs were really designating the same thing - or different things, or different parts of the same thing, or ... ooops, notice how quickly we're on that slippery slope. "

And this neatly sums up the situation with URIs. The low chance of collision represents a tradeoff: You get a high level of semantic precision - it's extremely unlikely that two parties will use the same URI to mean two totally unconnected things. You also get a very low level of semantic interoperability: it's equally unlikely that two unconnected parties will use the same URI to denote (even parts of!) the same thing.

Now I think the precision part is overrated - disambiguation of natural language terms can be tractably (and often trivially) achieved using contextual cues. However interoperablity of data from unconnected sources is really hard, and that's why I think this is a bad tradeoff.

Anyway, the crux of the Hendler piece is that for all the high level work going on in Semantic Web land (ontology languages, description logic), it's currently simple interoperability mechanisms that gain most traction and add the most value: 'a little semantics goes a long way'.

The piece implies (afaics) that this is where effort should be directed, and cites the example of matching FOAF data using email addresses as illustration of the potentual success of this approach. The matching heuristic is: if two FOAF resources are describing people with the same email address, they're very likely to be about the same person.

My experience concurs with the 'a little semantics goes a long way' sentiment, but personally I think FOAF has succeeded (for some measure of success) not because of RDF but in spite of it. I'd argue that the only reason the email matching works on a large scale is because email addresses are already concrete symbols grounded in the real world. FOAF didn't create them, it just provided a context for their use. FOAF's formal semantics certainly didn't create this interoperability - the largest example of foaf data is scraped from live journal's databases where the users creating the data have little concept of the ramifications of the 'http://xmlns.com/foaf/0.1/mbox' property.

If FOAF had to rely on artificial URIs as the sole means for identifying people it would struggle to gain any traction in the messy real world of the web.

However on the flip side I think FOAF would work just as well (and gain a lot more traction) if its underlying model didn't employ URIs at all and instead just used triples of words/symbols. Semantic web software would still be able to identify and index FOAF data: i.e. the symbol 'FOAF' is pretty unambiguous on its own, but even if it wasn't the juxtaposition of the symbol FOAF with properties like 'mbox', 'surname' etc.. would suffice for pretty accurate disambiguation.

Gowers Review of UK intellectual property released

Tom Coates comments on the review, commissioned by Gordon Brown to look at intellectual property rights in the UK:

if I'm reading it correctly, it contains recommendations that individuals should have the right to make private copies of their music, that copyright terms should not be extended and that there should be a general provision that any subsequent term extensions should not be retroactive - ie. that people with copyright get what they were entitled to when the work was created. It looks like he's also recommended no changes to the EU's patent law with regards to software patents or genes or business practices, and that there should be provisions which require media that contains DRM to be clearly labelled.

Naturally there's a wikipedia article with more info :

The British Phonographic Industry and prominent musicians, such as Cliff Richard and Ian Anderson, had lobbied for an extension to 95 years, matching the protection provided in the USA; other musicians, such as Dave Rowntree of Blur provided counteropinions. The Gowers Review found that the UK, compared with the USA, suffers no apparent impediment to creativity due to this disparity.

<lefty alert> I find it heartening that this is being approached from the standpoint of society rather than the individual (in contrast to much American debate on IP issues). Copyright and patent laws are primarily mechanisms designed to maximise the amount of invention/creativity in order to benefit society. Monopoly of the idea or work for the commercial gain of the inventor (or owner of the patent) is a secondary side effect included to amplify that primary benefit, and IMO an unfortunate one at that. </lefty alert>

The full review can be found here(pdf)

Music DRM on its way out?

Nick Carr's thesis is:

  • IPod music sales plateaued in Q1 this year
  • so apple starts losing it's leverage with music companies
  • but music companies will want to sell music that plays on IPods
  • so unprotected mp3s seems the way to go

But won't this lead to rampant piracy?

No. Because there's already rampant illegal copying. Most unauthorized copying is done either through online file-sharing networks or by burning CDs for friends. DRM schemes have little effect on either of those. All new songs are immediately available on file-sharing networks, DRM or not.

EMI is experimenting with selling un-encumbered songs - is this the start of a new trend?

Hierarchical Temporal Memory (HTM) Resources

I've been thinking about AI again recently.
This time I got motivated by watching a couple of videos of Jeff Hawkins talking about his HTM (Hierarchical Temporal memory) ideas. Actually this has been a common theme for me recently - motivated speakers on podcasts and videos are much more likely to get me interested in something than the written word.

Anyway, a look through the OnIntelligence forum brought up a couple of papers I hadn't seen before: Saulius Garalevicius has been experimenting with Jeff and Dileep's visual pattern recognition ideas and provides a much more lucid picture of his system than either Dileep or Jeff in their papers. If you're interested in neural networks or HTM then definitely check these out!

Despite all of the promising results, Saulius' data highlights a scale problem: memory usage of the model quickly undergoes a combinatorial explosion as it goes up the hierarchy, and each subnode of the network ends up dealing with a large conditional probablity matrix. This is frustrating - if the research scientists can't get this to scale then what hope have spare-time chancers got?

Whenever I get disheartened about such problems I like to re-read Paul Graham's 'A plan for spam' for motivation. Statistical spam filtering techniques were written off by some full time researchers as being ineffective until Graham demonstrated impressive results on his own data. The main reasons for these results were: (1) using a much larger corpus of training data and (2) massaging the input data to improve statistical recognition (e.g. being careful about tokenisation). These days the most advanced statistical filters (e.g. crm114) frequently exceed human accuracy at detecting spam (often by a factor of 10 or so!).

The real brain appears to use employ both of these tactics - the input data is massive and continuous, and e.g. the eye appears to use tricks to assist the statistical analysis of the information coming in (e.g. saccades, fovea, motor driven feedback). Now I'm wondering if a much larger corpus of data could reduce the necessary sophistication in the algorithm.

For me however, the real challenge is in finding an appropriate problem to solve that can drive this stuff (and my enthusiasm) incrementally.

(as usual, all out-of-my-depth caveats apply)

Custom domains on wordpress.com!

At last! Now you can use your own domain name with the excellent wordpress.com service. Actually this is pretty old news now - I almost missed this because I was on holiday and only just noticed it because Scoble has started using 'scobleizer.com' in his permalinks.

Personally I think this is huge; recently a few people have asked me for blogging service recommendations and my only reservation about wordpress.com was that you got tied into their service (because your permalinks had 'wordpress.com' in them). This move eliminates that barrer and one of the big reasons preventing potentual bloggers from starting.

Get blogging people!

Refactoring and the Repl

I'm still perservering with Gambit scheme, and progressing pretty slowly it has to be said. The first thing I've been missing is the lack of refactoring tools for scheme.

I wrote the basic python refactoring functionality in bicyclerepairman a long while ago, and having it as part of my daily toolset has strongly influenced the way I program. For example, I tend to follow the 'bash out some code and then clean it up' style of development. In particular, I have a habit of naming variables and functions badly and then renaming them later as I code.

So my initial thought is: no problem - I'll just knock up a bicyclerepairman for scheme! The problem is that I'm not quite sure how to do automated refactoring with a repl. You see Python has no real repl culture (sure it has a repl, but nobody uses it except for trying out simple expressions). People tend to run their program/unittests from scratch each iteration, which means the entire environment gets re-evaluated on each run.

The challenge with running a repl while you develop is keeping it in sync with your refactored code: E.g. if I rename a function that's used in multiple places, that results in lots of code that needs re-evaluating. Can this be done automatically (e.g. could it be made to work by just re-eval'ing files?). Hmm.. I think I need to talk to somebody with a lot more scheme experience than I have. Unfortunately I don't actually know any experienced schemers, especially not in London or Birmingham; maybe somebody from lshift can help?

Gold Mine

I recently discovered Steve Yegge's republished internal amazon blog. This is an absolute gold-mine of insightful commentry about programming, blogging, organisiation and some other stuff. Think 'Paul Graham' but with less VC and more blogging.

I slurped the whole lot onto my laptop for reading on the tube, bus etc..:


wget -rNk -np "http://steve.yegge.googlepages.com/blog-rants"

Uncovering treasure troves like this makes me wonder how many other essay collections I'm missing. Programming Reddit found me this one, and I can't help thinking there must be loads more out there.

OpenID gaining momentum

OpenID is a decentralized identity/single-sign-on system which uses URLs to identify people. Now I haven't particularly looked at this stuff for a while, so I was surprised to see how much had happened in the last few months.

In particular:

  • Verisign have joined the openid bandwagon. They've got somebody participating in openid2.0, and they've got an openid based identity service. I think this adds a little extra credability to the whole thing, at least for enterprisy types.
  • Dick Hardt from SXIP is now involved in the openid2.0 spec. This is good news - I wasn't keen on the original sxip1.0 but I think Dick is one of those charismatic types who promotes things until they succeed. Hopefully he'll be a Dave Winer for open digital identity.
  • Some companies have got together and funded an openid promotion initative: iwantmyopenid.com.

All this momentum has prompted me to add openid to my wordpress system - hopefully at some point in the future I'll be able to turn the anonymous commenting stuff off alltogether. I used this plugin for the comment authentication functionality, which worked out of the box (although I did have to tweak it a bit to remove the livejournal cruft and get things looking right). I haven't tried the 'use your blog as an openid server' stuff as I prefer to delegate to a 3rd party provider.

Actually having said that, I don't see the delegation feature mentioned enough: The super-cool thing about openid is that it allows you to delegate authentication to a 3rd party security provider but still use identity URLs under your domain (and thus your control). E.g. my public openid is 'phildawes.net', which means that other web sites authenticate me by going to the http://phildawes.net/ url (which I control). However I currently have the following text in the head of the html page served at that address:


<html>
<head>
...
<link rel="openid.server" href="http://www.myopenid.com/server" />
<link rel="openid.delegate" href="http://phildawes.myopenid.com/" />
<meta http-equiv="X-XRDS-Location" content="http://phildawes.myopenid.com/xrds"
...
</head>

This gubbins tells sites to do the the actual authentication with myopenid.com (which is a free openid security provider). The upshot is that I don't have to run my own security authentication software to control things, but I'm also free to move to a different provider at any time without changing/losing my online identity. Sweet!

Scheme is love

I've been battling again with Scheme recently. Having spent the last couple of months playing with various languages, I've come to the conclusion that scheme is the only one that has any real possibility of becoming my next 'general purpose language'. Python held that crown for many years, but its lack of blocks and concurrency caused me to start looking elsewhere and now I'm spoilt.

So, to Scheme. I've not found another language that can offer:

  • functional programming
  • message-passing concurrency (see termite)
  • macros
  • continuations
  • terse syntax
  • hardly any language cludges

...and as somebody who programs for fun in his spare time, these things really do matter to me. The biggest obstacle to full enlightment is the s-expression aesthetic: To my algol-shaped brain that lisp syntax just looks so damn ugly!

Anyway, I'm finding that the most enjoyable and self-affirming way to develop some scheme skills is (ironically) to re-read Peter Seibel's 'Practical Common Lisp' book with scheme glasses on. Now if there's anyone going to convince me that lisp syntax isn't just a grotty heap of parentheses, it's going to be Peter. His book just radiates lisp-love, and you can't help but be hooked. It says 'Look! You fools! Just look what you're missing!'. I've been translating various examples into scheme, just to test the water.