More import optimisation
Sep 23rd, 2004 by Phil Dawes
Claire’s out tonight, so another evening spent on bulk rdf importing. Have managed to get the original 120705 statement dataset import down to 77.6 seconds - that’s ~1500 triples a second!
The extra speed was mainly due to removing the need for database URI to id lookups by taking an in-memory copy of the hashes table. The problem wasn’t really the lookups (which were cached), but rather the need to check each new URI to see if it’s already been used (which thwarts the cache each time). I suspect that Bloom (or DeGan) filters would be really useful here, but I plumped for a python dictionary of 64bit md5 hashes (3store style) since it was easy for me to code quickly.
Anyway, working on proprietary data is not much use for benchmarking purposes, so I ran the new import code over the wordnet 1.6 rdf files. Managed to get all 473589 statements into my store in 315 seconds! - still within the 1500 per second mark.
For anyone wanting to compare with other stores, the order of import matters - I imported in the following order:
- wordnet_hyponyms-20010201.rdf.xml
- wordnet_nouns-20010201.rdf.xml
- wordnet_glossary-20010201.rdf.xml
- wordnet_similar-20010201.rdf.xml
which appeared to produce the fastest results, although I’ve not looked into why. Import includes indexing, removing duplicates, doing some FC inferences (although not many).
Oh yeah - I used my work laptop to do the test - is a powerbook g4 with half a gig of ram running gentoo linux. Mysql v4.0.20.

Triple Loading
The one with benchmark results for loading triples into a Redland/MySQL store, including a somewhat pretty diagram.
If you want to test your system with a really large data set (150M triples), have a look at http://www.isb-sib.ch/~ejain/rdf/data/
I believe the only way to load such amounts of data within reasonable time on reasonable hardware is to make use of the underlying database’s bulk loading facilities - I gather you chose a similar approach. We can load 6′000 triples per second, most of which is required for building all the indexes…
Hi Eric,
When you say 6000 triples a second, is this from rdf/xml, or already parsed into some sort of optimized format?