Solving the Bicyclerepairman ‘you have to save before you query’ problem

BicycleRepairMan operates by searching and modifying python files on the filesystem, and thus has always required that you save your work before you do a query or a refactoring. I've never felt this to be a big deal before, but more recently I've been using its functionality more aggressively within emacs and I've started to see this as a bit of a pain.

Moreover, as BRM (hopefully) develops 'autocomplete' functionality IDEs are going to want to pass partially completed (unsaved) code to BRM. I originally thought this problem could be solved by passing a copy of the unsaved buffer through to BRM, however this proved to be more tricky than I thought - the pymacs python bridge for emacs doesn't cope well with large chunks of unescaped text and even if I fix that I can't expect that other IDEs will be problem free.

The best solution came in the form of emacs 'autosave' files: For those not familiar: emacs periodically saves the contents of unsaved buffers into temporary files (just in case the power goes or something). The filenames are the same as the original filename but prefixed with a # or a dot. All I had to do was make emacs auto-save all the modified buffers prior to the query, and then have BRM load these files if they existed and were newer than the 'real' python ones. I can't think of any reason why this shouldn't work with other IDEs - any ideas?

BazaarNG and Mercurial and Git

I've been using bazaarNG (bzr) for bicyclerepairman version control recently, but I've also got a close eye on Mercurial(hg) and Git.

Git is Linus' implementation of a distributed SCM tool (sort of) for use with managing the decentralized development of the linux kernel, Mercurial is a project started at much the same time as Linux steered away from the commercial bitkeeper.

Here's the differences according to my very limited experience:

  • Both mercurial and git feel more snappy and responsive than bzr. The hg command returns immediately, most operations are O(1) or O(files).
  • Bzr and Mercurial are x-platform and work on windows. Git only works on unix (slowly on cygwin apparently).
  • Bzr and Mercurial implement a single branch in a working directory. For multiple branches, you need multiple copies of the working directory. Git provides multiple branches in the same working directory, and you change between them with 'git checkout <branchname>'.
  • Bzr is python only, Mercurial is python with a bit of C, Git is C only. Git is currently more of a pain to build - no autoconf.
  • Tailor 0.9.21 can convert both to and from bzr and git repositories, but only to mercurial.
  • Bzr is maintained by a commercial company, which always makes me a little wary - does the development community disappear when the company goes bust?

BicycleRepairMan performance tricks: Masking strings and comments in the source

I didn't have a weblog when I originally wrote the bulk of the bicyclerepairman querying and refactoring functionality, which I think is a shame because it meant that the design decisions never really got documented. In an attempt to rectify this I'm (hopefully) going to sling these up onto my weblog now in the hope that they'll be of use/interest to somebody.

Basically, BRM's parsing and querying engine is stateless - it parses the code fresh off of the filesystem for each query or refactoring. AFAIK this is in contrast to other refactoring engines (for other languages) which build up an upfront detailed model of the source code and operate on that*. Perhaps the most surprising thing is the speed in which it's able to do this - especially if you've ever used the python compiler package which parses at a snails pace.

The key to BRM's query speed is in careful leverage of the re regular expression module; basically BRM text-searches the code first to narrow down the search space before embarking on detailed parsing of the interesting sections. In order to do this accurately BRM employs a simple but effective technique - it masks out the contents of strings and comments with '*' characters. This means that occurances of keywords and variable names in strings and comments won't be found by a regex search.

I've pasted the code of the 'maskStringsAndComments()' function below for reference (it's currently part of the bike.parsing.parserutils module).


import re
import string

escapedQuotesRE = re.compile(r"(\\\\|\\\"|\\\')")

stringsAndCommentsRE =  \
      re.compile("(\"\"\".*?\"\"\"|'''.*?'''|\"[^\"]*\"|\'[^\']*\'|#.*?\n)", re.DOTALL)

allchars = string.maketrans("", "")
allcharsExceptNewline = allchars[: allchars.index('\n')]+allchars[allchars.index('\n')+1:]
allcharsExceptNewlineTranstable = string.maketrans(allcharsExceptNewline, '*'*len(allcharsExceptNewline))


# replaces all chars in a string or a comment with * (except newlines).
# this ensures that text searches don't mistake comments for keywords, and that all
# matches are in the same line/comment as the original
def maskStringsAndComments(src):
    src = escapedQuotesRE.sub("**", src)
    allstrings = stringsAndCommentsRE.split(src)
    # every odd element is a string or comment
    for i in xrange(1, len(allstrings), 2):
        if allstrings[i].startswith("'''")or allstrings[i].startswith('"""'):
            allstrings[i] = allstrings[i][:3]+ \
                           allstrings[i][3:-3].translate(allcharsExceptNewlineTranstable)+ \
                           allstrings[i][-3:]
        else:
            allstrings[i] = allstrings[i][0]+ \
                           allstrings[i][1:-1].translate(allcharsExceptNewlineTranstable)+ \
                           allstrings[i][-1]

    return "".join(allstrings)

  • N.B. BRM used to work in that way in the early days, but I found that building a model was far to slow (taking minutes to build), and maintaining the model was cumbersome

Decentralized Version Control

Have been experimenting with BazaarNG, a decentralized version control system. I find decentralized source control intriguing mainly I've been using CVS for so long that it's odd not to have a central coordinating server. Anyway, I've converted the bicyclerepairman CVS tree into a Bazaar branch using Tailor, so you could say I'm sort of committed now.

A benefit of decentralized version control is low barrier to entry - you don't need to setup and manage a central server, so getting something version controlled is just a case of doing 'bzr init' in the root directory of the project you want versioned. This creates a local branch - note that the working copy is the branch - there's no checking out of a seperate copy to develop on. Now I never got round to creating a subversion repo for my tagtriples stuff mainly because the overhead didn't seem worth it for a one man project. I suspect I would have created a bzr one from the word go and that would have had a number of advantages.

From an opensource perspective the interesting thing is that everyone effectively has their own branch by default. They can publish this branch on the web by just sticking it somewhere (e.g. using rsync to keep it up to date) and then merge and cherry pick updates from other branches trivially - the powerful decentralized tracking algorithms carefully track the provenance and history of each changeset.

The appealing thing about this for me is the ease at which people can join in project development. With CVS or Subversion the project maintainer must approve somebody and give them write access to the source repository for versioned development to happen. As a developer this usually requires creating some un-versioned patches here and there to prove your worth. With Bazaar you just create your own branch from somebody else's public one and away you go - do some versioned changes and then mail the maintainer pointing them at your branch.

Refactoring the singleton state out of Bicyclerepairman

My recent work on protest has rekindled latent interest in bicyclerepairman - my (aging) python refactoring toolkit project. BRM has a number of useful facilities for parsing and querying python source trees, and I think this functionality is just the sort of thing Protest needs for inter-package dependency tracking.

The problem is that these facilities are heavily tangled into the bicyclerepairman codebase and are proving tricky to prise out. The main reason for this is state and singletons.

In the early days BRM worked by constructing a big up-front abstract syntax tree model of the code from which to do its code traversal and manipulations. This central tree was maintained by BRM throughout the refactoring/development session. The 'big-AST' design was informed at the time by the then state-of-the-art refactoring toolkits from the Smalltalk and Java worlds, but the approach turned out to be a dead end: Python is too dynamic to be able to do this sort of up-front code inspection accurately. And building the AST was slooooowww...

Later I migrated the design to a completely different approach which involved dynamically inspecting the code base. This design didn't physically require a big stateful tree to be maintained, but the legacy of long-term state left its hooks in the design well past its sell-by-date. Migrating the design over a period of time meant leaving the illusion of a central AST (so that the old code still worked), but now generated dynamically on the fly in a 'lazy' fashion. Unfortunately this notion of centralized structure still provided implicit hooks for hidden global state (like a common 'pythonpath').

As an aside: I always find it interesting when I come back to code I've written a few years proviously - if you'd asked me to list the things I'd learnt about programming between then and now I probably wouldn't have be able to come up with anything tangible, but looking at this I can easily spot old habits and subsequent hard learnt lessons. These days I have a strong urge to break things into small chunks, and that generally means minimising state wherever possible. I think this is informed by my coding at work where I sit in an infrastructure team where everybody has their own different programming skills and favourite languages. I write my stuff mostly in Python, whereas other team members prefer Ruby, PHP and Perl and occasionally Java. I think having a successful team with diverse language skills without stifling productivity means that one must minimise the amount of maintenance time spent on any component - essentially building applications to be replaced rather than maintained. Thus all our stuff comes in small packages; big web applications and workflows end up being lots of smaller components knitted together with HTTP, RSS and databases.

Anyway, back to the main picture: I spent a few hours yesterday refactoring the central state out of bicyclerepairman. I've been off work recently with food poisoning and was feeling pretty weak so a hot cup of tea and a braindead refactoring session sounded like a palateable idea. Actually it turned out to be a game of wits lasting many hours. Teasing out the global state, propping up the old algorithms with stubs and hacks, removing them one by one.

Going through this exercise reminded me of a couple of things: 1) Unittests are the only way you can do this sort of thing productively. To go fast you need to make big educated bets about how things work and then be cavilier about changing and simplifying them: A unittest framework tells you when you're wrong, and sooner rather than later.

2) It reminded me why emacs is the king of editors. I've been doubting this a bit recently, but I think this would have taken twice as long without emacs' powerful keyboard macro support. Somebody ought to do an emacs screencast.

Anyway, the code is now clean. Although the external API is the same, internally BRM is a loosly coupled lean machine ripe for further beautification and cleanup. It should be trivial now to use the BRM query and parsing functionality within the protest project. In return, I hope protest will repay bicyclerepairman in the form of testable documentation and a new lease of life.

The moral of the story: I'm now more convinced than ever that singletons and 'central' state are the root of all programming evil. The last 15 years in IT has demonstrated numerous general trends where moving from the central to the decentralized as created a network effect of value - I'm not sure code is any different.

Protest rocks! (generate documentation from tests)

I've been working a bit with Nat Pryce on his 'Protest' project recently. It's a python unit test framework which generates documentation from the tests. E.g. you write test cases like:


class myFunctionTests:    
    ''' myfunction is a function belonging to me '''

    def does_something_cool():
        ... code which asserts that something cool is done ...

    def does_something_else_as_well():
        assert(something_else)

and the test framework will generate web documentation along the lines of :

myFunction myfunction is a function belonging to me Features of function myFunction:
  • does something cool
  • does something else as well

It then goes on to show the tests that confirm these statements all nicely marked up and also does some nifty graphviz diagramming stuff - all pretty rinky dinky. I'm hoping to get round to using it to document bicyclerepairman before my motivation runs out.

Anyway, the really interesting thing is seeing how the documentation informs which tests I write. In general I'm testing stuff that I wouldn't have bothered with before just so that I get some doc for it. It also ensures that documentation doesn't rot since each piece of documentation is tested against the codebase. Sweet!

There's no proper 'release' as such yet, but if you're interested in partially working software then Nat's got the subversion repository in his xspecs sf project - just do a

svn checkout https://svn.sourceforge.net/svnroot/xspecs/protest-python/trunk

(hope that's ok Nat!)

BicycleReparMan 0.9

I've released BRM 0.9 - this is a release containing all the bug fixes from the last few months.

I did intend on this being a much bigger feature-rich release, but I've not been spending much time on BRM recently. I thought it was about time I released something so that people didn't need to go the CVS to get the latest stuff.