Literal searching - grep vs tablescan
Published Oct 29 2004 by Phil Dawes
One of the most useful bits of veudas' functionality is being able to search the knowledgebase for labels (and subproperties) containing a particular string. Unfortunately this is also the slowest operation. Currently veudas' ifp store does a table scan over the nodes table for regex searches - very slow (mysql doesnt support a general regex index).
I'm currently only handling a few million triples in my store, so I'm not ready to invest a large amount of time writing e.g. a suffix tree implementation. I do need a performance boost though, since some text searches take a number of seconds; Given this, I thought I'd compare database performance with flat-file searching using the Wordnet data (222439 literals). Here's the comparison data between (1) mysql table scan and (2) grepping a flat file.
1) put all the literals in 1 table and do a regex table scan
select id from labelcache where value LIKE "%football%" . > 146 matches in 0.32 seconds
2) put all the literals in a file and do a grep and pull results into python
mysql -D ifptest1 -u root -e 'select id,value from nodes where literal = "1"' > nodes
and then...
t = time.time() f = os.popen("grep -i football nodes") ids = [l[:l.index('t')] for l in f] print len(ids),"matches in",time.time() - t,"seconds" . > 146 matches in 0.042 seconds
I found this quite surprising - grepping a file and piping to python is an order of magnitude faster than mysql doing a table scan(!)
The rest of the query (once the results have been reduced to ids) takes 0.02 seconds. It looks like grepping a file may be a quick and easy way of getting a ~x10 performance boost (at least in the short term).
