whitebox unit tests slow you down

Is it just me or do whitebox unit tests really bog you down?

I do pretty much all my coding in a test-first stylee; it's the only way to code if you're snatching 20mins here and there for spare time projects. Much of the time these tests serve as scaffolding to keep me on the straight and narrow while I bootstrap up some functionality. Unfortunately after they've served this purpose they just sit there like a ball and chain round my leg slowing any future change in direction.

These days I've got into the habit of converting these tests into more stable blackbox functional tests once there's enough actual functionality to support it. Or I just delete them. Life's too short to be worrying about breaking brittle old tests.

Imaginary numbers: ‘Better Explained’ does it again

Learning Maths suddenly got easier when Kalid entered the scene. Better Explained has a 'A Visual, Intuitive Guide to Imaginary Numbers' that's every bit as penny-droppingly fantastic as his guide to exponential functions and e.

Trigonometry is great, but complex numbers make ugly calculations simple (like calculating cosine(a+b) ). This is just a preview; later articles will give you the full meal...

Kalid's excitement for maths is infectious and I think this is because he gives off the impression that he's only just got it himself and is desperate to share this enlightenment with the world.

Tidying up factor code

A little utility I wrote for clearing up code: words-not-used. You give it a word and a vocabulary and it tells you all the definitions in the vocab that aren't used by the execution of the word. It's handy for clearing up old code that is no longer used.

(n.b. the word doesn't have to be defined in the vocabulary)

Usage: e.g. "phil.vocab" \ doit words-not-used . { unused-word another-unused-word ... }

Here's the code. Is it useful enough to go in the factor distribution?


USING: kernel assocs definitions vocabs vars sequences namespaces ;
IN: prune-code

VAR: wordset

: add-word ( word -- )
  "" swap wordset> set-at ;

: (recursive-uses) ( defspec -- )
  dup wordset> key? 
  [ drop ] 
  [ dup add-word uses [ (recursive-uses) ] each ] if ;

: recursive-uses ( defspec -- hashtable )
  H{ } clone wordset [ (recursive-uses) wordset> ] with-variable ;

: list>hashtable ( list -- hash )
  H{ } clone tuck [ "" -rot set-at ] curry each ;

! returns all the words in the vocab not used by the recursively by defspec
: words-not-used ( vocab defspec -- list ) 
  recursive-uses swap words list>hashtable diff keys ;

URIs are syntactically universal, not semantically universal

Thanks to all who commented to my previous post, it's made me rethink and clarify my position on the problems with scaling Semantic Web technologies. I boiled it down to this:

Semantic Web clients beware: URIs are syntactically universal, not semantically universal.

The rationale for this is that although it is practically impossible for two disconnected parties to 'mint' the same syntactic URI, it is much more likely that two connected parties will use the same URI to refer to two similar but differing concepts.

The conceptual difference may just be missing qualification: e.g. one document refering to somebody in 1995 and another in 2001. Or it could be genuine misunderstanding: For example our unix server team includes the location of a server as part of its identity - when a server is moved to a different datacentre it effectively becomes a completely different server. Other teams assuming they understand the identity scheme get a surprise when the linked unix server data disappears.

Now this isn't a problem with URIs per se: any other universal identifier scheme would exhibit the same trait. It is however a problem for Semantic Web software which commonly treats URIs as semantically universal without qualification. The bottom line is that before merging RDF graphs one must first manually compare the contexts of each source document to confirm that the URIs in them are mutually compatible.

W3C Semantic Web = Global Ontology after all?

I only just read Jim Hendler's piece from last month "shirkying my responsibility", in which he states that the W3C Semantic Web vision was never about a global shared ontology at all:

"Get it - we are opposing the idea of everyone sharing common concepts."

This seems odd to me, because if that is the case and all communication on the semantic web is local then why is the basic system of identity the URI, a global identifier scheme?

On the contrary, I suspect that the W3C Semantic Web is predicated on global agreement: that all RDF documents containing a URI should use it to identify the same concept, otherwise the whole RDF inference stack breaks. A global ontology that's defined in lots of inter-connected pieces scattered around the web is still a global ontology.

Are you a doer or a talker?

Jeff Atwood's blog is a good source for collected quotes. He's at it again in his latest post: Are you a Doer or a Talker?:

In software, some developers take up residence on planet architecture.'
'Working code attracts people who want to code. Design documents attract people who want to talk about coding.'

In my experience talkers gravitate towards company technical architecture teams and committees. Luckily DKIB IT doesn't have one, which I think makes it a refreshing place to work.

The Factor Attraction

It's been a couple of months and I'm still slower writing code in Factor than in Python or Scheme. So why am I still writing code in Factor? Well it turns out that the problem is also the attraction:

You can't hack in factor.

In python you can hack out code in multi-line functions, parking results in variables and combining them in subsequent lines. You can have a function that fills half the screen and manipulates >10 local variables and still trivially keep track of what's going on.

Factor on the other hand scales horribly both with respect to lines of code per word (function) and the amount of local state (number of variables). This is because local state is manipulated on a stack which you have to keep track of in your head. Anything exceeding 3 variables and a single line of code and the cognitive load starts to ramp up exponentually. Four variables and the mind is constantly distracted trying to keep track of the stack order. Five and you're spinning wheels, going nowhere fast.

So instead you're constantly having to come up with neat composable abstractions to fold up the state. This is the sort of thing that makes code elegant, loosely coupled and small in any language. However where as with other languages splitting your code into neat composable blocks is considered good practice, in factor you simply don't have any choice: you just can't keep parking state on the stack.

Not surprisingly factor is absolutely jam-packed with abstraction mechanisms: blocks, partial application, objects, generic functions, parsing words (macros), dynamic variables, namespaces, modules. You name it. And because the language forces tiny composable reusable functions the library is packed full of them.

Plus, factor has excellent tools for traversing code. This is super important in a language which encourages thousands of discreet one liners and I think this is the first time I've seen this stuff done properly without tying you to an IDE. (in fact there's so much good here that I should probably devote a whole post to it)

All this may actually make factor is the ultimate teaching language. Why? - well in other languages the novice programmer can write an awful lot of procedural code using just functional decomposition before feeling the pressure to look for other abstractions. With factor that pressure is there almost from the start and so the novice is forced to search the bigger picture for abstraction possibilites almost constantly. I think it's this abstraction pressure which helps people to 'get' language features in an intuitive tanguable way, and that's pretty much the point of a teaching language.

So will the factor attraction last with me? The interesting thing will be to see if I ever get up to 'scheme speed' with factor. At the moment I have to be really switched on and concentrating each time I code. I'm forced to step back all the time to see if there's a better way to do something, and I'm constantly re-writing stuff. I'm sorta hoping this is just early days - that at some point I'll get to a level of experience where I spot the bigger ideas much quicker. Either way, factor is a lot of fun to use and is compelling in the way an interesting puzzle is. And ultimately I'm sure this is making me a better programmer.

Frequent code checkpointing with git

A nice feature of distributed version control is that you can commit into your repository more often because you aren't impacting others with each commit. Recently I've been taking this to the extreme using git and performing a commit almost at every save. I have an emacs key wired for the job: it save the buffers and then runs 'git commit -a -m "checkpoint"'. During a coding session I hammer it frequently.

I've found this approach particularly useful when I'm programming with unfamiliar tools (as I am at the moment with factor). I tend to re-write stuff a lot and occasionally change my mind mid-rewrite, wanting some old code back. Using git in the above manner makes it painless to checkpoint my work continuously and pull back changes from previous revisions when appropriate.

However there's a problem when I want to share my repository with others: Nobody else is interested in the 100s of incremental checkpoint commits with no commit messages; they want to see commits in functional units which tally with the changelog. To make things more legable for them I need to be able to roll up the checkpoints into functional commits with full commit messages.

Originally I assumed that the best way to do this was to have two branches: one for checkpoint commits whilst developing, and a second public one for containing the larger functional commits.

o--o--o--o--o--o--o--o--o  < --- checkpoint (dev) branch
  /  ___/ ______/
 /  /  __/
|  |  |
v  v  v
o--o--o                    <--- public branch ('proper' commits)

I tried various approaches to merging multiple commits from the checkpoint branch into single commits in the public branch, but couldn't find anything that worked.

I think the main problem with the 2 branch idea is that once you've rolled up the commits from one branch into a single commit in the other (e.g. with git merge --squash), they no longer match and so the branches don't have a recent common ancestor. This means git is unable to track the histories by checksum and so each subsequent merge results in conflicts that must be resolved by hand.

I found the easiest way to get round this was to just copy the content over from one branch to the other rather than merging it. This could be done either by creating patches or by simply checking out the contents of one into the other with 'git checkout branch .' (i.e. checkout branch <path>). This removes the need to resolve conflicts, but the branches still don't share common ancestors. Ultimately you have to manage the branches separately - for example you have to pull external changesets into each branch individually.

In the end the best way turned out to be to dispense with the 2nd public branch all together and just operate in one. My method is as follows:

I tag the most recent public commit in the branch, and then perform lots of checkpoint commits as I code. When I'm ready to roll up the checkpoint commits into the next 'proper' commit I go back to the previous public commit with:

% git checkout public 
Then, assuming master is the current branch I checkout the contents of the HEAD of the branch (i.e. all the checkpointed commits) into the working directory, but without moving the index:

% git checkout master .   

Then I move the HEAD of the master branch to this point. I do this by deleting and recreating the branch again:

% git branch -D master
% git checkout -b master

Finally I commit the changes and tag this as the new latest public commit:

- git commit -a     
- git tag -f public 

And that's it. Now I haven't been using this technique for long, so there's a good chance that something might trip me up in the future - If anybody can see a problem with this (or a better way) then I'd really appreciate a comment.