BicycleRepairMan performance tricks: Masking strings and comments in the source

I didn't have a weblog when I originally wrote the bulk of the bicyclerepairman querying and refactoring functionality, which I think is a shame because it meant that the design decisions never really got documented. In an attempt to rectify this I'm (hopefully) going to sling these up onto my weblog now in the hope that they'll be of use/interest to somebody.

Basically, BRM's parsing and querying engine is stateless - it parses the code fresh off of the filesystem for each query or refactoring. AFAIK this is in contrast to other refactoring engines (for other languages) which build up an upfront detailed model of the source code and operate on that*. Perhaps the most surprising thing is the speed in which it's able to do this - especially if you've ever used the python compiler package which parses at a snails pace.

The key to BRM's query speed is in careful leverage of the re regular expression module; basically BRM text-searches the code first to narrow down the search space before embarking on detailed parsing of the interesting sections. In order to do this accurately BRM employs a simple but effective technique - it masks out the contents of strings and comments with '*' characters. This means that occurances of keywords and variable names in strings and comments won't be found by a regex search.

I've pasted the code of the 'maskStringsAndComments()' function below for reference (it's currently part of the bike.parsing.parserutils module).


import re
import string

escapedQuotesRE = re.compile(r"(\\\\|\\\"|\\\')")

stringsAndCommentsRE =  \
      re.compile("(\"\"\".*?\"\"\"|'''.*?'''|\"[^\"]*\"|\'[^\']*\'|#.*?\n)", re.DOTALL)

allchars = string.maketrans("", "")
allcharsExceptNewline = allchars[: allchars.index('\n')]+allchars[allchars.index('\n')+1:]
allcharsExceptNewlineTranstable = string.maketrans(allcharsExceptNewline, '*'*len(allcharsExceptNewline))


# replaces all chars in a string or a comment with * (except newlines).
# this ensures that text searches don't mistake comments for keywords, and that all
# matches are in the same line/comment as the original
def maskStringsAndComments(src):
    src = escapedQuotesRE.sub("**", src)
    allstrings = stringsAndCommentsRE.split(src)
    # every odd element is a string or comment
    for i in xrange(1, len(allstrings), 2):
        if allstrings[i].startswith("'''")or allstrings[i].startswith('"""'):
            allstrings[i] = allstrings[i][:3]+ \
                           allstrings[i][3:-3].translate(allcharsExceptNewlineTranstable)+ \
                           allstrings[i][-3:]
        else:
            allstrings[i] = allstrings[i][0]+ \
                           allstrings[i][1:-1].translate(allcharsExceptNewlineTranstable)+ \
                           allstrings[i][-1]

    return "".join(allstrings)






N.B. BRM used to work in that way in the early days, but I found that building a model was far to slow (taking minutes to build), and maintaining the model was cumbersome