BicycleRepairMan performance tricks: Masking strings and comments in the source
Apr 11th, 2006 by Phil Dawes
I didn’t have a weblog when I originally wrote the bulk of the bicyclerepairman querying and refactoring functionality, which I think is a shame because it meant that the design decisions never really got documented.
In an attempt to rectify this I’m (hopefully) going to sling these up onto my weblog now in the hope that they’ll be of use/interest to somebody.
Basically, BRM’s parsing and querying engine is stateless - it parses the code fresh off of the filesystem for each query or refactoring. AFAIK this is in contrast to other refactoring engines (for other languages) which build up an upfront detailed model of the source code and operate on that*. Perhaps the most surprising thing is the speed in which it’s able to do this - especially if you’ve ever used the python compiler package which parses at a snails pace.
The key to BRM’s query speed is in careful leverage of the re regular expression module; basically BRM text-searches the code first to narrow down the search space before embarking on detailed parsing of the interesting sections. In order to do this accurately BRM employs a simple but effective technique - it masks out the contents of strings and comments with ‘*’ characters. This means that occurances of keywords and variable names in strings and comments won’t be found by a regex search.
I’ve pasted the code of the ‘maskStringsAndComments()’ function below for reference (it’s currently part of the bike.parsing.parserutils module).
import re
import string
escapedQuotesRE = re.compile(r"(\\|\\"|\')")
stringsAndCommentsRE =
re.compile("(\"\"\".*?\"\"\"|'''.*?'''|\"[^\"]*\"|'[^']*'|#.*?n)", re.DOTALL)
allchars = string.maketrans("", "")
allcharsExceptNewline = allchars[: allchars.index('n')]+allchars[allchars.index('n')+1:]
allcharsExceptNewlineTranstable = string.maketrans(allcharsExceptNewline, '*'*len(allcharsExceptNewline))
# replaces all chars in a string or a comment with * (except newlines).
# this ensures that text searches don't mistake comments for keywords, and that all
# matches are in the same line/comment as the original
def maskStringsAndComments(src):
src = escapedQuotesRE.sub("**", src)
allstrings = stringsAndCommentsRE.split(src)
# every odd element is a string or comment
for i in xrange(1, len(allstrings), 2):
if allstrings[i].startswith("'''")or allstrings[i].startswith('"""'):
allstrings[i] = allstrings[i][:3]+
allstrings[i][3:-3].translate(allcharsExceptNewlineTranstable)+
allstrings[i][-3:]
else:
allstrings[i] = allstrings[i][0]+
allstrings[i][1:-1].translate(allcharsExceptNewlineTranstable)+
allstrings[i][-1]
return "".join(allstrings)
* N.B. BRM used to work in that way in the early days, but I found that building a model was far to slow (taking minutes to build), and maintaining the model was cumbersome

I’m delighted to see signs of your renewed interest in BRM. I’ve become somewhat addicted to Eclipse’s refactoring support for Java and have started working with BRM to get some of the same support when I do Python development. I’ve been astonished with just how effective BRM is. Astonished becuase the software doesn’t seem to get an enourmous amount of attention given how useful it is.
I use SPE which doesn’t have refactoring support but is otherwise a very nice Python IDE. I’ve taken to using BRM’s integration with IDLE — treating that combination as a “refactoring browsing” sidecar to SPE. Not ideal but still *very* useful.
You might want to know that the BRM 0.9 tarball contains a corrupt BicycleRepairMan_IDLE.py — bunch of non-whitespace characters around line 303. I got around it by downloading the HEAD of that file from Sourceforge CVS.
Best regards,
Keith
Yet unknown for the public, but SPE will support refactoring through BRM soon.