Clustering triplestores

I've been thinking a bit about scaling triplestores recently. My mysql based tagtriples store is working well at work but is ultimately limited by the amount of memory you can cram into a single machine. I've recently become seduced by the idea of putting all the bank's data (world's data?) into one massive triplestore, and that really requires clustering somehow.

So how can you utilise a cluster of machines to manage triple-oriented data?

AFAICS the simplest approach would be to just evenly spread the triple indexes across the machines in the cluster. An agent coordinating a query would then need to break the query into indiviual triple patterns, and then send each one out in turn to the members of the cluster (holding the appropriate piece of index), joining the results with that of previous requests.

Assuming that network is the overriding performance bottleneck, this would result in query performance scaling roughly linearly with the number of pattern joins in the query. (assuming that the query join order was optimised appropriately)

One optimisation to this scheme would be to split the data indexes across the cluster nodes according to data source/provenance (i.e. by triple 'graph'). Experience with tagtriples shows that a large proportion of query joins are intra-graph* and so by doing this you could limit the roundtrips to just inter-graph joins.

(* Probably moreso with tagtriples than with RDF since tagtriples doesn't have precise identifiers, so you need to join more to ensure precision)

Server length limitations on HTTP GET URLs

I just did a little bit of testing at work and thought I'd dump it here for future googlers:

We've got an internal app team building some REST webservices and they wanted to know the practical limits of GET URL length. Now browser limits are well known, but since the clients will be programatic they wanted to know what the server limits were.

They're using java and so I tested apache+ajp+tomcat and apache+ajp+jboss, followed by tomcat and jboss standalone. (apache 2.0.49, jboss 4.0.0, tomcat 5.5.9)

In all cases url lengths of 8000 characters were processed correctly, but 8100 characters consistently caused failures in the webservers generating either a 500 error or just closing the socket.

Having thought about it, this could be a tomcat thing since jboss 4.0.0. embeds tomcat as the internal container. If I get the chance I might get round to trying a standalone apache (e.g. using CGI or something). (The http spec puts no limits on URLs btw)

Bzr Vs Mercurial (again)

Ok - for me it boils down to: hg's speed vs bzr's renames

Mercurial is much much quicker. That's not an imperical measurement using a large source tree - it's an anecdotal observation on a tiny one. For example, just typing 'hg' returns in 80ms, 'bzr' takes more than half a second: That's just to dump the help text. Actual commits, logs etc.. appear to take similar order-of-magnitude differences in time.

This shouldn't matter to me; with a small source repository its just a small user experience thing, but for some reason it niggles. Actually I can't help thinking this is an easy fix. Maybe I'll crank up the profiler tonight if I have a spare moment.

On the other hand, afaics mercurial doesn't do directory renames. If I move my source directory structure about I lose the per-file history. That sucks, especially for new projects that haven't quite got their file structures sorted yet.

Solving the Bicyclerepairman ‘you have to save before you query’ problem

BicycleRepairMan operates by searching and modifying python files on the filesystem, and thus has always required that you save your work before you do a query or a refactoring. I've never felt this to be a big deal before, but more recently I've been using its functionality more aggressively within emacs and I've started to see this as a bit of a pain.

Moreover, as BRM (hopefully) develops 'autocomplete' functionality IDEs are going to want to pass partially completed (unsaved) code to BRM. I originally thought this problem could be solved by passing a copy of the unsaved buffer through to BRM, however this proved to be more tricky than I thought - the pymacs python bridge for emacs doesn't cope well with large chunks of unescaped text and even if I fix that I can't expect that other IDEs will be problem free.

The best solution came in the form of emacs 'autosave' files: For those not familiar: emacs periodically saves the contents of unsaved buffers into temporary files (just in case the power goes or something). The filenames are the same as the original filename but prefixed with a # or a dot. All I had to do was make emacs auto-save all the modified buffers prior to the query, and then have BRM load these files if they existed and were newer than the 'real' python ones. I can't think of any reason why this shouldn't work with other IDEs - any ideas?

Ruby and Python

I've been playing with Ruby a bit recently. The big question: is the language better than python ? For me it comes down to a punchup between 2 killer features: Ruby's blocks vs Python's whitespace-indent magic.

Ruby's smalltalk style blocks are great - they neatly support looping, closures and resource management idioms in one simple powerful construct. Much better than the 3 or 4 python alternatives required to support them (for, with, def etc..).

On the other hand, Ruby's superflous 'end' statements everywhere make me balk. As Tim Bray mentioned on a recent podcast: Python is just right about the whitespace thing. The problem is that I've been spoilt by python's killer indents-seperate-blocks for too long to want to go back now.

Will I convert in the end? Hmm... dunno.

BazaarNG and Mercurial and Git

I've been using bazaarNG (bzr) for bicyclerepairman version control recently, but I've also got a close eye on Mercurial(hg) and Git.

Git is Linus' implementation of a distributed SCM tool (sort of) for use with managing the decentralized development of the linux kernel, Mercurial is a project started at much the same time as Linux steered away from the commercial bitkeeper.

Here's the differences according to my very limited experience:

  • Both mercurial and git feel more snappy and responsive than bzr. The hg command returns immediately, most operations are O(1) or O(files).
  • Bzr and Mercurial are x-platform and work on windows. Git only works on unix (slowly on cygwin apparently).
  • Bzr and Mercurial implement a single branch in a working directory. For multiple branches, you need multiple copies of the working directory. Git provides multiple branches in the same working directory, and you change between them with 'git checkout <branchname>'.
  • Bzr is python only, Mercurial is python with a bit of C, Git is C only. Git is currently more of a pain to build - no autoconf.
  • Tailor 0.9.21 can convert both to and from bzr and git repositories, but only to mercurial.
  • Bzr is maintained by a commercial company, which always makes me a little wary - does the development community disappear when the company goes bust?

Cups are better than Metric when cooking

I tried a couple of american recipes recently, and I have to say: the cups system of measuring is so much better than metric for cookery.

Basically, rather than saying 125g or 50 ml or whatever, the recipes say 1/2 cup, or 3/4 cup etc.. Once you've seen what a 'cup' looks like, you can easily visualize how much of each ingredient you need without having to weigh or measure it out.

Decentralized source control allows more frequent committing

One benefit of using the decentralized 'every programmer has their own branch' approach to source control that I hadn't thought of until I tried it: You can commit more frequently.

This is because you can commit without impacting others. With CVS and subversion I usually keep a little commit text file where I pile up the commitlog notes until I have a sourcecode tree stable enough to commit (i.e. all unit tests succeeding). With BazaarNG I just commit away as much as I want, and then rsync to the public branch when I'm done.

N.B. I'm frequently away from a net connection (travelling on trains and buses), and so the times I get to commit aren't always the times when my source tree is stable enough to.

BicycleRepairMan performance tricks: Masking strings and comments in the source

I didn't have a weblog when I originally wrote the bulk of the bicyclerepairman querying and refactoring functionality, which I think is a shame because it meant that the design decisions never really got documented. In an attempt to rectify this I'm (hopefully) going to sling these up onto my weblog now in the hope that they'll be of use/interest to somebody.

Basically, BRM's parsing and querying engine is stateless - it parses the code fresh off of the filesystem for each query or refactoring. AFAIK this is in contrast to other refactoring engines (for other languages) which build up an upfront detailed model of the source code and operate on that*. Perhaps the most surprising thing is the speed in which it's able to do this - especially if you've ever used the python compiler package which parses at a snails pace.

The key to BRM's query speed is in careful leverage of the re regular expression module; basically BRM text-searches the code first to narrow down the search space before embarking on detailed parsing of the interesting sections. In order to do this accurately BRM employs a simple but effective technique - it masks out the contents of strings and comments with '*' characters. This means that occurances of keywords and variable names in strings and comments won't be found by a regex search.

I've pasted the code of the 'maskStringsAndComments()' function below for reference (it's currently part of the bike.parsing.parserutils module).


import re
import string

escapedQuotesRE = re.compile(r"(\\\\|\\\"|\\\')")

stringsAndCommentsRE =  \
      re.compile("(\"\"\".*?\"\"\"|'''.*?'''|\"[^\"]*\"|\'[^\']*\'|#.*?\n)", re.DOTALL)

allchars = string.maketrans("", "")
allcharsExceptNewline = allchars[: allchars.index('\n')]+allchars[allchars.index('\n')+1:]
allcharsExceptNewlineTranstable = string.maketrans(allcharsExceptNewline, '*'*len(allcharsExceptNewline))


# replaces all chars in a string or a comment with * (except newlines).
# this ensures that text searches don't mistake comments for keywords, and that all
# matches are in the same line/comment as the original
def maskStringsAndComments(src):
    src = escapedQuotesRE.sub("**", src)
    allstrings = stringsAndCommentsRE.split(src)
    # every odd element is a string or comment
    for i in xrange(1, len(allstrings), 2):
        if allstrings[i].startswith("'''")or allstrings[i].startswith('"""'):
            allstrings[i] = allstrings[i][:3]+ \
                           allstrings[i][3:-3].translate(allcharsExceptNewlineTranstable)+ \
                           allstrings[i][-3:]
        else:
            allstrings[i] = allstrings[i][0]+ \
                           allstrings[i][1:-1].translate(allcharsExceptNewlineTranstable)+ \
                           allstrings[i][-1]

    return "".join(allstrings)

  • N.B. BRM used to work in that way in the early days, but I found that building a model was far to slow (taking minutes to build), and maintaining the model was cumbersome