Feed on
Posts
Comments

I’ve made some modifications to the table layout used by my experimental smushing store. In particular, I’ve ditched the hashes table, and lumped node hashing in with the literals table. This improves smushing performance as it reduces the number of tables to update (the current biggest overhead when smushing resources on ifp).

Here’s the new layout:

CREATE TABLE `triples` (
`graph` integer NOT NULL default ‘0′,
`subject` integer NOT NULL default ‘0′,
`predicate` integer NOT NULL default ‘0′,
`object` integer NOT NULL default ‘0′,
`literal` tinyint(1) NOT NULL default ‘0′,
`inferred` tinyint(1) NOT NULL default ‘0′,
UNIQUE KEY `spog` (`subject`,`predicate`,`object`,`graph`),
KEY `posg` (`predicate`,`object`,`subject`,`graph`),
KEY `ospg` (`object`,`subject`,`predicate`,`graph`),
KEY `gspo` (`graph`,`subject`,`predicate`,`object`),
KEY `gpos` (`graph`,`predicate`,`object`,`subject`),
KEY `gosp` (`graph`,`object`,`subject`,`predicate`)
) TYPE=MyISAM

CREATE TABLE nodes (
id integer NOT NULL AUTO_INCREMENT,
hash bigint(20) NOT NULL,
value text,
PRIMARY KEY (hash),
KEY (id)
) TYPE=MyISAM

CREATE TABLE graphs (
id integer NOT NULL,
UNIQUE KEY (id)
) TYPE=MyISAM

N.B. I added a lot more indexing to the triples table a while back. This pretty much copies the kowari indexing from Paul’s post in May. (note that he wrongly assumes that mysql doesnt support use of compound indexes in his post. I’m sure somebody must have corrected him by now).

The cool thing about this is that all the relational queries can now be done by mysql solely in the indexes (no need crossref with the data file), which speeds things up a bit. Haven’t done any profiling with really large datasets yet, but performs well with a couple of million statements or so.

Most queries (once optimised by the mysql query engine) boil down to:

  1. Translate the literals/uris into IDs via the nodes table (using an md5 hash to rapidly look up the values).
  2. Do the triple pattern joins on the triples table (all within the indexes)
  3. translate the result ids back into values (uris or literals) by joining back to the nodes table.

Ideally I would have got mysql to index the ‘value’ column in the nodes table directly, and lost the md5 hash column altogether. Unfortunately mysql doesnt appear to have hash index functionality for text/blobs. (It indexes the first n bytes of the field - this doesnt work very well with URIs, and means I can’t put a UNIQUE constraint on the index.)

Any comments/suggestions for improvement gratefully received

Viewing 1 Comment

    • ^
    • v
    Nope, nobody corrected me. :-) I tend to operate in a vacuum, and so a lot of the time I could just be plain wrong in my assertions.
    In some ways using MySQL would have been the way to go, since it provides all the transaction support, remote APIs, etc, that we had to build into Kowari. On the other hand, we could never get it to load data very quickly, even without extensive indexing. I've also found that joining a table onto itself numerous times starts to make most relational databases slow down (including MySQL). Tracing a transitive predicate is an extreme example of this.
    Basically It would be great to get all the MySQL features for free, but having a datastore that is purpose built for triples (or quads) seems to be the only way to get really high performance on both loads and queries. Depends on how big you need to scale I suppose.
    Out of curiosity, have you considered different models in your system? Are you doing this with different "databases", tables, or something else?
close Reblog this comment
blog comments powered by Disqus