More import optimisation
Sep 23rd, 2004 by Phil Dawes
Claire’s out tonight, so another evening spent on bulk rdf importing. Have managed to get the original 120705 statement dataset import down to 77.6 seconds - that’s ~1500 triples a second!
The extra speed was mainly due to removing the need for database URI to id lookups by taking an in-memory copy of the hashes table. The problem wasn’t really the lookups (which were cached), but rather the need to check each new URI to see if it’s already been used (which thwarts the cache each time). I suspect that Bloom (or DeGan) filters would be really useful here, but I plumped for a python dictionary of 64bit md5 hashes (3store style) since it was easy for me to code quickly.
Anyway, working on proprietary data is not much use for benchmarking purposes, so I ran the new import code over the wordnet 1.6 rdf files. Managed to get all 473589 statements into my store in 315 seconds! - still within the 1500 per second mark.
For anyone wanting to compare with other stores, the order of import matters - I imported in the following order:
- wordnet_hyponyms-20010201.rdf.xml
- wordnet_nouns-20010201.rdf.xml
- wordnet_glossary-20010201.rdf.xml
- wordnet_similar-20010201.rdf.xml
which appeared to produce the fastest results, although I’ve not looked into why. Import includes indexing, removing duplicates, doing some FC inferences (although not many).
Oh yeah - I used my work laptop to do the test - is a powerbook g4 with half a gig of ram running gentoo linux. Mysql v4.0.20.

Add New Comment
Viewing 2 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Add New Comment