Feed on
Posts
Comments

The new triplestore is coming along. It can do substring text searches (using a suffix array) and has a basic relational query engine. It doesn’t optimise the query plans yet, but if you enter the queries in a good order (most selective clauses first) then you get good performance.

A few things have changed in my thinking since my last post about indexing. Although using hashes to represent tokens is really useful when joining datasets from different nodes in a cluster (no coordination overhead), I’m now thinking that they’re not such a good idea for when laying out the actual triple indexes in memory (or on disk).

My reasoning is:

In order to get the performance I want (100 row results from relational queries in ~half second) I’m either going to have to keep the entire set of indexes in memory, or at the very least minimise the disk activity to a small number of sequential reads. Disk seeks are the order of ~10ms so 50 of them and I’m shot. If I end up aiming for the all-in-memory approach then I want to cram as many triples in to memory as possible. If I do use disk reads then locality of reference will be fundamental to the layout of the indexes.

Either way, I’m going to need to use compression on the indexes to achieve optimal storage or read efficiency. The problem with hashes is that they introduce a lot of randomness into the mix which reduces the ability to do delta compression (and then run length encoding of the deltas). I suspect that controlling allocation of identifiers could also be very useful in optimising locality of reference. All this is theoretical at the moment as I haven’t actually implemented any index compression, but I hope to do this soon.

Viewing 3 Comments

    • ^
    • v
    re: optimal storage / read efficiency - have you tried reiser4? it does a wonderful job of not wasting disk space. a 'du -k' inside a dir used roughly the same amount of total space as a n3 serialization of the same data. said ~30 mb of data took up 230 mb on ext3. and, about one in every 5 triples is a blog post / news story text where theres a 5K chunk of text - the difference would be even more absurd if not for that. also read back is much faster than your numbers would suggest - its nowhere near 10 ms per call. what kind of drive are you using a 423 mb thing you found in a discared PC on the street?

    as for 'in memory' - the kernel disk cache is a great for 'in memory' - especially in the concurrency department - 10 mongrels can all benefit from it w/o a seperate memcached..

    as for indexing - i havent thought about it much yet - my query engine takes about 0.1 seconds for a basic 'fetch the content, title, author, date, abstract of ___ resources sorted by ascending date'.. hopefully that can be shaved down once i learn some stuff, and your previous post is my jumping off point - thanks!

    oh ya. wheres your source? mines http://whats-your.name/yard
    • ^
    • v
    I've had good experience storing my data in columns. If I sort the data in each column, I'll get very good compression.
    • ^
    • v
    @pigalle: thanks for the comments - I'll take a look at reiser4.
    The ~10ms latency is for a disk seek not a read. What sort of timings are you getting?

    @Seth: Cool - I'm planning on doing the same thing (have you read the research papers for cstore?).
close Reblog this comment
blog comments powered by Disqus

generic acomplia purchase cialis overnight delivery cheap acomplia online buy generic clomid buy cialis low price viagra without prescription where to buy cialis lowest price levitra where to buy propecia cheap cialis from canada lasix no prescription viagra without rx cheap accutane tablets viagra online without prescription viagra no rx buying cialis online zithromax viagra in uk free cialis cialis us where to buy acomplia find cialis online buy viagra lowest price accutane prescription buy cheap accutane online cialis buy buy generic cialis online acomplia order propecia online lowest price synthroid synthroid without a prescription synthroid online buy propecia online cheap levitra online where to buy levitra cialis online review synthroid prices cialis generic cialis buy drug buy viagra on line viagra pharmacy cialis for order price of levitra zithromax online where to buy synthroid soma generic generic clomid propecia online stores viagra cheap drug cheap generic soma cialis cheap zithromax online cheap order accutane online purchase zithromax online purchase viagra online buy cheap clomid cheap generic propecia zithromax pharmacy online pharmacy cialis cheapest acomplia cost of cialis no prescription viagra free viagra purchase lasix online cialis from india viagra from india order discount cialis soma online stores find no rx cialis cialis no rx required find viagra without prescription approved cialis pharmacy lasix discount