Indexing structured data (again)
Apr 29th, 2007 by Phil Dawes
Now that I’m up and running and starting to get productive on Gambit-C, I’ve turned my attention back to indexing structured data.
I’ve modified the tagtriples idea a bit to reflect my experience on importing data in other formats. I still think the most effective approach is not to try and define an interchange format, but instead to create an indexing scheme that can index data from lots of different structured data formats/models.
So the end goal is a system that can easily aggregate and index static structured data from a variety of systems and formats, and provide relational joining across graphs/datasets. Oh, and it’s got to scale.
The data model I’m working with is similar to my tagtriples format except each subject is now system generated, local to the a particular graph/dataset, and opaque (i.e not exposed to the user). RDF officiandos will notice this is similiar to BNodes - think of it as RDF triples where the subject can only be a BNode and properties and objects are values.
#24221 name "Phil Dawes"
#24221 age 31
#24221 likes cheese
I’ve adopted this approach mainly to ease translation from other formats into this one. Automating the creation of good human-readable identifiers for subjects is often impossible when translating data from other formats that don’t contain a natural subject identifier.
Having opaque subjects also means that the concept of identity internal to the graph is specific, but joins across graphs must be achieved by matching properties and objects. This coincides with my ideas about global identity - i.e. that a universal identity scheme is very difficult (/impossible) to implement and coordinate usefully on a large scale. I think the best way of handling identity on a large scale is through a combination of shared context and description using terms from already understood local vocabularies (e.g. english, email addresses, URLs, post codes etc..).
What this means in practice is that the person/agent querying the data must know something of the terms and structures used in the datasets to be able to join them. It also means that they must include additional clauses in the query to provide enough disambiguation to uniquely identifiy terms in the dataset (in the absence of URIs).
E.g. I want the “Phil Dawes” with an email address ‘phil@example.com’.
In practice this disambiguation tends to happen anyway even with RDF data (e.g. witness smushing of FOAF data). This is simply because coordinating artificial identifiers on a large scale is so difficult that people naturally revert to matching commonly understood terms from pre-existing vocabularies.
(hence my previous posts about URIs being a bad tradeoff and not providing much value for the problems they create).
Anyway, from the indexing point of view I’ve ditched a relational database as a backend and am now working with memory mapped files. This provides more flexibility for indexing strategies. I’m going to begin by using sorted arrays for triple indexes as I’m working with relatively static data and this sounds like a good tradeoff of space to search speed. More about ideas on indexing in another post.
From the point of view of internal symbol identifiers, I’m leaning towards using hashes internally to represent symbols rather than incrementally allocated ids. This is similar to the approach adopted by 3store(PDF).
N.B. Personally I don’t think it’s a good tradeoff for 3store. 3store isn’t (wasn’t?) a federated store and so coordinating incremental allocated IDs is easy and would halve the index space (32bits for an ID vs 64 bits for a hash). Index size vs memory is the biggest performance bottleneck in my experience.
For a distributed parallel database however, hashes pay dividends. You don’t need to worry about allocating unique IDs between parallel processes, and more importantly you don’t need to coordinate IDs between nodes for data joining.
Anyway, I’ve got some ideas about distributed indexing so hopefully I’ll get a chance to write about this soon.

I’d consider using an n-ary structure for your indexes instead of just a sorted list: Sort your index, then take every nth item (sized so the number of items returned just fits in a single disk block), and create a node from them. Recurse for each subdivision thus created. This will give you _much_ better locality of reference when you access the index, so for an infrequently used index, you won’t have to read nearly as much into memory.
are there any stores that don’t support named-graphs/quads/contexts and instead focus on providing niceties for a single ’shared context’ as you put it.
my biggest problem has been stuff like - how do you grab all the children of http://mysite/cars without either bypassing the RDF lib and doing “select * from resources where id like ‘http://mysite.com/cars%’” since doing the same in SPARQL is just waaaay to slow, and doing a triple like:
is pretty insane. don’t get me started on count/group_by etc..
[…] wrote a bit about representing structured data in the last post. Here’s some ideas for how I plan to index the […]
[…] Indexing structured data (again) Now that I’m up and running and starting to get productive on Gambit-C, I’ve turned my attention back to indexing structured data. I’ve modified the tagtriples idea a bit to reflect my experience on importing data in other formats. I still think the most effective approach is not to try and … […]
check out www.coppereye.com
already doing this…
… or SAND Technology. Their DNA Access product not only indexes every value, it also deduplicates, compacts and compresses - in some instances the resulting footprint is less than 2% of original data (so 100 Terabytes reduced to 2 Terabytes - but still queryable with SQL).