Suffix array performance problems

Hit some bad performance hotspots at work yesterday with the new suffix stuff.

The main problem occurs when you do text searches with multiple words, some of which are very popular. The query optimiser frequently makes a bad decision, which boils down to not being able to compute the permutations of the popular word, as it crosses 2 joins (suffixes->suffixlinks->triples) before exploding. The biggest problem case was the word 'drkw' (i.e. the name of the company), which occurs in few suffixes (resulting in a good match on the index), but translates to many thousands of literals when linked through to the triplestore. E.g. In a search for 'drkw dam', the optimiser favours using 'drkw' first, since it has fewer suffix index matches, whereas a search for dam would return just a few hundred eventual literals.

Claire's gone to see a friend for a couple of hours, so have got some time to play with this...

Go Spaminator!

Installed Spaminator a few days ago to combat the rise in spam comments on this blog. So far it's correctly identified 49 items of spam with no false positives!

RDF Literal searching using a suffix array

Text searches were becoming a big bottleneck in the knowledge base deployment at work. The grep solution allows speedup of pure text searches (such as a label search), but doesn't work so well when the text search is part of a bigger query.

Ideally I would have used mysql's fulltext indexing capability to index the literals; the problem is that this just indexes full words, and a lot of the searches require substring matches within the data (e.g. when searching for server names, the client often knows the server number 'e.g. 859', but not the full code 'ln3p0859app').

I had some hacking time over the xmas break, and I got round to implementing a basic suffix array within the veudas db schema that maps word suffixes to literals via a links table. This enables the suffixes to be indexed (I index the first 4 characters at the moment) for rapid matching of substrings to literals. The tables are like this:

CREATE TABLE suffixes ( id integer NOT NULL AUTO_INCREMENT, suffix text, KEY (suffix(4)), UNIQUE KEY (id) ) TYPE=MyISAM CREATE TABLE suffixlinks ( suffix integer, len integer, literal integer, KEY (suffix,len) ) TYPE=MyISAM

The length field in the suffixlinks is used in order to reduce the number of suffixes. E.g. the suffix 'aced' matches the word 'places' for 3 letters, and the word 'placed' for 4. The generated SQL snippet for a text search (in this case for substring '859') linking up to the core triples table looks something like this:

SELECT ...WHERE... AND suffixes.suffix LIKE '859%' AND suffixes.id = suffixlnk.suffix AND suffixlinks.len >= 3 AND suffixlinks.literal = triples.object

In order to reduce the number of suffixes I split the literals into words before generating suffixes from them. This means that text searches only match over words rather than whole literals (e.g. a search for 'phil dawes' matches 'dawes,phil' and 'phil dawes' and 'philip dawes'), but this is fine for my knowledge base application.

I would be interested in knowing if there is a better way to index text substrings in a relational database - I couldn't find much information about this on the web.

Veudas and veudastore packages containing this implementation can be downloaded from the veudas sourceforge project page.

veudastore standalone rdf store

I keep meaning to perfect these packages, but it doesnt happen so here's another 'release-early' package. Veudastore is a standalone version of the mysql RDF backend store that runs the veudas knowledge manager. It can be downloaded from the veudas sf project page. From the readme:

INSTALL: -------- - Install mysql and create an empty database USAGE: ------ import veudastore ctx = veudastore.init() ctx.setDatabaseConnectParams(USER,PASSWD,DB,HOST) ctx.createTables() ctx.importFromURISource(sourceuri, graphuri) q = """ prefix foaf: <http://xmlns.com/foaf/0.1/> prefix wnet: <http://xmlns.com/wordnet/1.6/> select ?p where (?p foaf:nick "danbri") (?p rdf:type wnet:Person) """) cols, rows = ctx.querySPARQL(q) for p, in rows: . print p

rdfwrapper

I'm in danger of underusing this blog - I wrote the stuff for this post a while back, but didn't get round to posting it:

Have put together a package of wrapper code I've been developing as I write python RDF applications. It wraps the rdflib and sparta behaviour to present a nice (readable) api. well niceish.

e.g.

s = Store() s.set_prefix_mapping("ex","http://example.com/") phil = s["ex:Phil"] phil.foaf_name = "Phil Dawes" print s.serialize() <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:ex="http://example.com/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" > <rdf:Description rdf:about="http://example.com/Phil"> <foaf:name>Phil Dawes</foaf:name> </rdf:Description> </rdf:RDF>

Note that assigning to a property overwrites any existing property statements against the subject. If you don't want to overwrite, you can use the add() method.

e.g.

phil.add("foaf:nick","Phil") phil.add("foaf:nick","George Dawes - the man with the scores")

gives phil 2 foaf nicknames

You can get an iterator to all the values of foaf:nick by using the get() method:

for nickname in phil.get("foaf:nick"): . print nickname

Resource objects work interchangably with the rdflib search functions

for person in s.subjects(TYPE,s['foaf:Person']): . print person.foaf_name

As an added bonus, I've also added basic n3 parsing support by bolting Sean Palmer's yapps n3 parser to rdflib.

s.load("http://myrdf.com/foo.n3","n3") or s.parse(myN3,"n3")

Hope this helps someone!

wordpress installation fixed

Have fixed my wordpress after somebody deleted all the public writable files on the (shared) server. Was impressed by how easy it was to fix - just unpacked a new version of wordpress and ran the upgrade script.

Thanks Dan for pointing out that it wasn't working.

BTW, Merry Christmas everyone!

Literal searching - grep vs tablescan

One of the most useful bits of veudas' functionality is being able to search the knowledgebase for labels (and subproperties) containing a particular string. Unfortunately this is also the slowest operation. Currently veudas' ifp store does a table scan over the nodes table for regex searches - very slow (mysql doesnt support a general regex index).

I'm currently only handling a few million triples in my store, so I'm not ready to invest a large amount of time writing e.g. a suffix tree implementation. I do need a performance boost though, since some text searches take a number of seconds; Given this, I thought I'd compare database performance with flat-file searching using the Wordnet data (222439 literals). Here's the comparison data between (1) mysql table scan and (2) grepping a flat file.

1) put all the literals in 1 table and do a regex table scan

select id from labelcache where value LIKE "%football%" . > 146 matches in 0.32 seconds

2) put all the literals in a file and do a grep and pull results into python

mysql -D ifptest1 -u root -e 'select id,value from nodes where literal = "1"' > nodes

and then...

t = time.time() f = os.popen("grep -i football nodes") ids = [l[:l.index('t')] for l in f] print len(ids),"matches in",time.time() - t,"seconds" . > 146 matches in 0.042 seconds

I found this quite surprising - grepping a file and piping to python is an order of magnitude faster than mysql doing a table scan(!)

The rest of the query (once the results have been reduced to ids) takes 0.02 seconds. It looks like grepping a file may be a quick and easy way of getting a ~x10 performance boost (at least in the short term).

sparta & my ideal python/rdf api

I've been using sparta quite a bit recently, and have found it to be a really cool way of handling rdf in python. For me it's about 80% there.

The main problem is that fundamentally there's a mismatch between the way python represents objects and rdf constrains resources: In python, an object has at most 1 value for each property. In RDF a resource can have multiple values of the same property (corresponding to multiple subject,property,value statements).

Sparta currently handles this mismatch by using a generator interface for each property allowing you to get each value by doing:

  for value in myresource.property:
      # .. do something with value

This is great for properties that do have multiple values, but I find that most properties in my data don't follow this rule and I have a lot of:

  a = myresource.property.next() 

lines littering my code. Also when setting values on a resource I have to delete the original value before setting it (in order to overwrite it).

  del(myresource.property)
  myresource.property = "newvalue"

Sparta does support a feature that if there exists a statement setting the 'owl:maxCardinality' of a property is '1', it treats the property as a pythonic attribute, allowing you to do

myresource.property = "newvalue" # overwrites the old value print myresource.property # prints "myvalue"

Unfortunately I can see 2 problems with this feature:

  • Using it requires you to effectively declare your properties before use (which inhibits my python coding habits)
  • The addition of an owl:maxCardinality statement can cause existing code to break

Instead, I was wondering if the maxCardinality=1 behaviour could be the default. Something like the following:

.
  print resource.property              # get the value of 'property'. If there are 
                                       # multiple values then arbitrarily pick one 
                                       # (or maybe raise an exception)
.
  for value in resource.property:      # iterate over all values of 'property'
      print value
.
  resource.property = "foo"            # remove any existing 'property' values 
                                       # and assert "foo"
.
  resource.add('property',value)       # add an additional property/value statement 
                                       # for resource

Seems quite natural and doable to me - can anybody see any problems with this?