When is ‘buy better than build’?

I don't have much experience outside software, but when I think back to the previous successful attempts at buying/bringing in an outside software component or service, they all have something in common: commodity.

On the other hand, I can't think of a single time (both in my experience at current and previous employers) where outsourcing or buying a tailored component or service has resulted in something more successful than if we'd built/done it in-house.

I was giving this a bit of thought on the way into work this morning, especially in light of this quote from Malcolm's blog:

to outsource something you must understand the APIs and contracts between the outsourced and the remaining before you initiate the outsource

When consuming a commodity product, the customer is generally prepared to bend its requirements and approaches to fit that product. The commodity service becomes the authority :- 'this is how it's done', and the contract of service is rigid and easily understood.

However, a tailored service leaves the API/contract and the authority to the customer, which means that it must understand and articulate its own requirements. Unfortunately providing a service or product on adhoc requirements (especially poorly understood ones) requires a high degree of communication between supplier and customer, not to mention common understanding. This is where in-house building does better in my experience - the communication bandwidth is higher and the understanding and motivations are much more closely aligned.

So I would hazard a guess that it is better to buy when the supplier sets the interface (and sticks to it) and the customer is able to understand and adapt to that interface. Usually that means the supplier has already built the product or service and is merely selling it. Letting the client negotiate the interface/terms is much more error-prone and risky, and is rarely successful in my experience (both in terms of cost and return).

Decentralized Version Control

Have been experimenting with BazaarNG, a decentralized version control system. I find decentralized source control intriguing mainly I've been using CVS for so long that it's odd not to have a central coordinating server. Anyway, I've converted the bicyclerepairman CVS tree into a Bazaar branch using Tailor, so you could say I'm sort of committed now.

A benefit of decentralized version control is low barrier to entry - you don't need to setup and manage a central server, so getting something version controlled is just a case of doing 'bzr init' in the root directory of the project you want versioned. This creates a local branch - note that the working copy is the branch - there's no checking out of a seperate copy to develop on. Now I never got round to creating a subversion repo for my tagtriples stuff mainly because the overhead didn't seem worth it for a one man project. I suspect I would have created a bzr one from the word go and that would have had a number of advantages.

From an opensource perspective the interesting thing is that everyone effectively has their own branch by default. They can publish this branch on the web by just sticking it somewhere (e.g. using rsync to keep it up to date) and then merge and cherry pick updates from other branches trivially - the powerful decentralized tracking algorithms carefully track the provenance and history of each changeset.

The appealing thing about this for me is the ease at which people can join in project development. With CVS or Subversion the project maintainer must approve somebody and give them write access to the source repository for versioned development to happen. As a developer this usually requires creating some un-versioned patches here and there to prove your worth. With Bazaar you just create your own branch from somebody else's public one and away you go - do some versioned changes and then mail the maintainer pointing them at your branch.

Refactoring the singleton state out of Bicyclerepairman

My recent work on protest has rekindled latent interest in bicyclerepairman - my (aging) python refactoring toolkit project. BRM has a number of useful facilities for parsing and querying python source trees, and I think this functionality is just the sort of thing Protest needs for inter-package dependency tracking.

The problem is that these facilities are heavily tangled into the bicyclerepairman codebase and are proving tricky to prise out. The main reason for this is state and singletons.

In the early days BRM worked by constructing a big up-front abstract syntax tree model of the code from which to do its code traversal and manipulations. This central tree was maintained by BRM throughout the refactoring/development session. The 'big-AST' design was informed at the time by the then state-of-the-art refactoring toolkits from the Smalltalk and Java worlds, but the approach turned out to be a dead end: Python is too dynamic to be able to do this sort of up-front code inspection accurately. And building the AST was slooooowww...

Later I migrated the design to a completely different approach which involved dynamically inspecting the code base. This design didn't physically require a big stateful tree to be maintained, but the legacy of long-term state left its hooks in the design well past its sell-by-date. Migrating the design over a period of time meant leaving the illusion of a central AST (so that the old code still worked), but now generated dynamically on the fly in a 'lazy' fashion. Unfortunately this notion of centralized structure still provided implicit hooks for hidden global state (like a common 'pythonpath').

As an aside: I always find it interesting when I come back to code I've written a few years proviously - if you'd asked me to list the things I'd learnt about programming between then and now I probably wouldn't have be able to come up with anything tangible, but looking at this I can easily spot old habits and subsequent hard learnt lessons. These days I have a strong urge to break things into small chunks, and that generally means minimising state wherever possible. I think this is informed by my coding at work where I sit in an infrastructure team where everybody has their own different programming skills and favourite languages. I write my stuff mostly in Python, whereas other team members prefer Ruby, PHP and Perl and occasionally Java. I think having a successful team with diverse language skills without stifling productivity means that one must minimise the amount of maintenance time spent on any component - essentially building applications to be replaced rather than maintained. Thus all our stuff comes in small packages; big web applications and workflows end up being lots of smaller components knitted together with HTTP, RSS and databases.

Anyway, back to the main picture: I spent a few hours yesterday refactoring the central state out of bicyclerepairman. I've been off work recently with food poisoning and was feeling pretty weak so a hot cup of tea and a braindead refactoring session sounded like a palateable idea. Actually it turned out to be a game of wits lasting many hours. Teasing out the global state, propping up the old algorithms with stubs and hacks, removing them one by one.

Going through this exercise reminded me of a couple of things: 1) Unittests are the only way you can do this sort of thing productively. To go fast you need to make big educated bets about how things work and then be cavilier about changing and simplifying them: A unittest framework tells you when you're wrong, and sooner rather than later.

2) It reminded me why emacs is the king of editors. I've been doubting this a bit recently, but I think this would have taken twice as long without emacs' powerful keyboard macro support. Somebody ought to do an emacs screencast.

Anyway, the code is now clean. Although the external API is the same, internally BRM is a loosly coupled lean machine ripe for further beautification and cleanup. It should be trivial now to use the BRM query and parsing functionality within the protest project. In return, I hope protest will repay bicyclerepairman in the form of testable documentation and a new lease of life.

The moral of the story: I'm now more convinced than ever that singletons and 'central' state are the root of all programming evil. The last 15 years in IT has demonstrated numerous general trends where moving from the central to the decentralized as created a network effect of value - I'm not sure code is any different.

Protest rocks! (generate documentation from tests)

I've been working a bit with Nat Pryce on his 'Protest' project recently. It's a python unit test framework which generates documentation from the tests. E.g. you write test cases like:


class myFunctionTests:    
    ''' myfunction is a function belonging to me '''

    def does_something_cool():
        ... code which asserts that something cool is done ...

    def does_something_else_as_well():
        assert(something_else)

and the test framework will generate web documentation along the lines of :

myFunction myfunction is a function belonging to me Features of function myFunction:
  • does something cool
  • does something else as well

It then goes on to show the tests that confirm these statements all nicely marked up and also does some nifty graphviz diagramming stuff - all pretty rinky dinky. I'm hoping to get round to using it to document bicyclerepairman before my motivation runs out.

Anyway, the really interesting thing is seeing how the documentation informs which tests I write. In general I'm testing stuff that I wouldn't have bothered with before just so that I get some doc for it. It also ensures that documentation doesn't rot since each piece of documentation is tested against the codebase. Sweet!

There's no proper 'release' as such yet, but if you're interested in partially working software then Nat's got the subversion repository in his xspecs sf project - just do a

svn checkout https://svn.sourceforge.net/svnroot/xspecs/protest-python/trunk

(hope that's ok Nat!)

Meaning and identity

Seth posted a response to my 'global identifiers don't scale' post that I didn't expect. His point is that it is the meaning that isn't consistent across the semantic web, not the identifiers. I agree with him about meaning being inconsistent, but it's the distinction that confuses me - in my posts I've conflated identity with meaning, but Seth asserts that they are different things. He says:

"Take, for example, the URI <http://example.org/xyz>. You can say that: <http://example.org/xyz> a :Animal and I can say: <http://example.org/xyz> a :Car The context here is Who Said What. What's not up for question is that there is some Resource identified by <http://example.org/xyz>, and that when we use that identifier we are talking about the same thing."

Now this seems the wrong way round to me. For communication to be successful it must be grounded on common reality and understanding. That reality isn't the labels we use, but the underlying meanings associated with them. I.e. we attach labels to things in the real world, not the other way round.

In the end I don't think the distinction matters much: Regardless of which way you draw it, the important thing is the shared meaning associated with each label - that's how understanding is derived and knowledge transferred. The fact that we're using the same identifier string doesn't amount to anything useful if we're not denoting consistent meanings with it.

Global identifier schemes don’t scale II

Thanks to everybody that commented on my 'global identifiers don't scale' post (and especially for Seth and John's responses). It's obvious that I didn't make myself very clear - sorry about that.

I think I made two main points:

The first was the scalability point (one I failed to make very well at all):

By 'global identifiers schemes don't scale', I was referring to the scalability of the global identity approach itself. As the system gets big and less consistent, the global identifiers cannot guarantee to unambiguously identify things across the system, and end up being augumented with a secondary mechanism - context.

I.e. global ids don't scale, but global ids + context do.

I think this is something we agree on. In the case of large RDF installations, the universal resource identifier (URI) system fails to unambiguously identify resources universally, and so in rdf store software it is augumented with a context/provenance system (e.g. quads). This isn't a theoretical issue with the model, it's a practical one to do with the problems of maintaining identifier consistency in a decentralised universe.

The second point was a more contentious one: that a better approach is to relax the 'universally unambiguous' constraint for identifiers and embrace context for disambiguation (which we are forced to do to some degree anyway).

This is contentious mainly because it questions the utility of URIs which are the root of the w3c semantic web. In the rest of this post I'm going to address URIs specifically, because it's easier to talk about specifics than generally. However I do think these arguments also apply to any global identifier scheme employed for a large decentralized information system.)

Ok, lets start with what we've got: RDF URIs yield identities that we can pretty much guarantee to be consistent across an individual document or communication, but that are only 'probably' consistent with uses in other communications (documents/graphs) due to the decentralized nature of the semantic web. That probability can be evaluated based on the contents, provenance and context of the communication.

Now the scheme I'm advocating is one with relaxed constraints on universalness, instead scoping at the document or communication level. This approach really differs only in the probability of collision across graphs: I.e. We can pretty much guarantee the identifiers to be consistent across an individual document or communication (we can make this a requirement), but the likelyhood that they are consistent with terms in another document relies much more heavily on the evaluation of the contents, provenance and context of the communication.

So my argument hinges on whether the benefit of a much higher probability of identifier consistency (but not guaranteed) is worth the high cost of using URIs. By way of 'costs', I listed some downsides in the last post which I won't repeat here, suffice to say that I think these costs are crippling the uptake of the W3C semweb technologies.

On to the million dollar question: How do you deal with disambiguation in a world where the identifiers aren't even remotely guaranteed to be consistent across documents? The answer turns out to be simple - you use a bunch of identifiers in union to disambiguate the term. John Barstow's question provides a convenient example:

"As for Phil's belief that you can use database primary keys without worrying about global namespace collision - how do you do that again when a record has different ids in different databases? Oh, you prepend a namespace so I know which database I’m talking about?"

Actually no, you can disambiguate with extra terms: E.g. the query:

'What is the price of product '987' from the foobah products database?'

can be pretty unambiguous to software agents of the foobah company. Here the terms '987', 'product', 'price' and 'foobah products database' can all be used to narrow the scope and identify a resource.

This is essentually 'identity by description', and is a system which scales pretty well in both directions: communication localized to a narrow context require a minimal number of identifying terms and keep communication simple (e.g. '987'), wider context communication can utilise more terms to increase the accuracy over the possible matches/meanings.

An advantage of a model based on such a scheme is that it can support existing data automatically - it removes the need to create new 'global' identifiers for each thing being imported. Also by implicitly allowing existing established identity schemes you get a good chance of serendipitious term sharing between documents - a high chance of being able to usefully merge data that was created without coordination.

So I would urge Semantic Web practitioners to consider the possibility of relaxing the 'universal identifier' constraints on identity: you still get to keep precision, but you also get vastly increased simplicity, lower barriers to entry and trivial access to all the existing data on the web! That's a pretty good trade in my opinion.

Global identifier schemes don’t scale

When designing large information systems to hold data from a wide range of sources (e.g. a large company inventory or knowledge base), a common approach is to employ a global identifier scheme so that entities can be referenced unambiguously across the system. A really large scale example of this approach is the W3C Semantic Web effort, which identifies entities with URIs - Universal Resource Identifiers.

However despite widespread attempts at deployment I actually don't think that the global identifier approach works very well in practice, and is especially sub-optimal when used in large decentralized (uncoordinated) systems. The reason for this is that without painstaking coordination activity, shared identifiers tend to take on meaning which is specific to the context in which they are used. This effectively results in the identifiers not being 'globally' unambiguous anymore.

To illustrate this, lets say I create the global identifier id:PhilDawes to mean me, and I then create the statement:

id:PhilDawes weight 10stone

On more careful analysis it becomes apparent that I'm actually not creating information about the general 'me', but rather a specific 'me' that existed on 7th March 2006 (i.e. when I asserted the information). If I use the same identifier to make similar statements about my weight at other points in my life then the merged information will be inconsistent (a person can't have more than one weight) - this is because the id:PhilDawes being described is actually different in each case.

In effect it means that my identifier isn't really global, or at least if it is I am not using it consistently. You can't merge the information about id:PhilDawes without considering the context under which the data was created to see if the thing being described is actually logically consistent in each case. The identifier is effectively local to the context of the communication.

Of course this could be considered a matter of precision - instead of making a single unqualified statement, I should have qualified it with enough information to ensure that id:Phildawes being described is the more abstract id:PhilDawes that I intended rather than the specific one. e.g.

'On 7th March 2006, id:PhilDawes weight 10stone'

Of course I may need to be even more specific than this in order to ensure consistency - maybe:

'On 7th March 2006 at 09:22.35, id:PhilDawes weight 10stone (without clothes on)'

The problem is that people don't write data like this - shared assumptions are desirable between communicating parties: they reduce the required communication bandwidth. This means that in a large decentralized system you can expect plenty of ambiguity as people share identifiers; the upshot is that the only realistic course of action is to always consider the context when evaluating what the identifier is refering to.

So what does this mean for large decentralised systems that employ global identifier schemes? Well, providing that you always consider the context when evaluating data you should be able to minimise the consistency issues. Except that that kinda defeats the point of using global identifiers in the first place.

Earlier I said that I thought the use of global identifiers was especially sub-optimal for decentralized systems. Here are some downsides to employing a global scheme in such an environment:

1) Identifiers need to be sufficiently large and namespaced to avoid collision

2) Because no one unambiguous system of identification exists, there is a bootstrap problem: identifiers must be 'invented' and communicated for each thing described.

3) Having to find, choose and use identifiers with minimal risk of ambiguity requires effort and represents a high barrier to entry for people creating data. High enough that people often invent their own ids rather than reusing others.

And some effects of these downsides:

(1) makes the serialized data pretty cumbersome for localized data exchange - this fact plagues RDF protocols and results in simpler context-specific systems being used. Both (2) and (3) massively reduce the chance of serendipitous data crossover in an uncoordinated system (the sort of network-effect magic that makes tagging systems useful and popular). (2) also ensures that there's no reliable way of automating the import of external data into the system, short of creating a completely new (unambiguous) set of identifiers (which won't then merge with anything). This inevitably means that importing data into the system becomes a manual job.

Finally, the ground-up architecture of 'no ambiguity' usually means that there's no way of reconciling or disambiguating different uses of the same identifier when they do arise (without resorting to some out-of-band solution). For example in RDF there's no way to compare or reason about two differing 'usages' of the same URI inside the system. This means conflict resolution cannot be articulated within the system.

So what's the alternative? : Local identifiers scoped within the context of the communication (or document or database or whatever). Well actually that's not an alternative, it's what you really had in the first place with the 'global' scheme, but now you're not alluding to the idea of universal consistency.

So now, free from the illusionary shackles of unambiguous global identity, you can reuse terms from existing identity schemes such as database primary-keys, zip codes and common language without worrying about global-scope collision. Automated generic importing of external data (e.g. databases, xml etc..) becomes possible, and information serendipitously combines without manual intervention.

Of course clients must evaluate context and provenance before using data, but in an uncoordinated system like the semantic web (or company knowledgebase) they had to do that anyway. The added advantage of allowing reuse of existing schemes is that the broader deployment of common language yields opportunities for statistical analysis over the data, which can be used as a tool to assist context evaluation. (e.g. see spam filtering, pagerank etc..)

So, to sum up: I think that the idea of universal consistency in a large decentralized system is an illusion, and alluding to it with a global identification scheme imposes unnecessary shackles on the growth and adoption of the system. In short: local scope will happen anyway - you're better off embracing it.

High fibre low fat fruitcake

I've been baking these for a couple of months, but last night's attempt came out really well. Basically I'm trying to make a healthy fruitcake that tastes good and can be eaten in place of biscuits and snacks during coffee breaks etc.. I make it in the breadmaker I got for christmas and it takes about 5 minutes to prepare.

1/2 cup buckwheat flour 1/2 cup soy flour 1/2 cup wholemeal flour 1/2 cup oat bran 4 teaspoons baking powder 1 tsp cinnamon 1/3 cup fruit sugar

1 cup skimmed milk 1/3 cup olive oil 1/2 cup mixed fruit

(The ott selection of flours is because I've been experimenting quite a bit and happen to have a lot of them in my cupboard. The bran increases the fibre content (and lowers the GI).

Instructions - stick all the dry stuff in the breadmaker bowl and mix it up (mainly to make sure the baking powder is mixed in). Then stick the rest of the ingredients in and stir it a bit so that you don't have big clumps of flour (the breadmaker mixing action doesn't appear to be that good at mixing cakes from scratch). Stick it in the breadmaker and put it on the cake setting. Take it out when it looks done - in my breadmaker that's about half an hour before the cake setting finishes.

The ‘meaning’ of an identifier

Here's an attempt at a working definition:

The 'meaning' of an identifier = the complete set of assertions that can be made about it.

(for the purpose of discussion of identity in information systems).

E.g. if an assertion is true for one identifier but not another, then they don't mean (or denote) the same thing.

Does that sound reasonable?