Feed on
Posts
Comments

In determining the meaning of tokens used in communication there are two widely used approaches to disambiguate that I’ll charactise as ‘namespacing’ and ‘context’.

When humans communicate amongst themselves they use the context of the communication to narrow down the range of possible meanings of terms used in the exchange, and human language doesn’t employ namespaces at all. On the other hand computer identifier schemes typically use namespaces to prevent term clash, and don’t use context at all.

The mechanisms operate differently:

Namespaces:

  • Every use of the namespaced term refers to the same concept.
    (or at least if it doesn’t this is considered an error)
  • Deterministic

Context:

  • the concept denoted to by the term depends on the context in which it is used
  • Statistical ( is that the right word? )

My thinking is that namespaces work well in a closed environment because coordination overhead is low and deterministic programs are easy to write. Namespaced schemes do however require a management mechanism to ensure that each use of the same term denotes exactly the same thing. This works well if the terms are grounded in the system - e.g. on the www a URL is used to fetch a document, and thus its use as identifier for that document is grounded.

However the semantic web is an open environment with little grounding, which means that holistic term coordination and management isn’t practical. Thus web-scale semantic web systems need to employ some degree of context based disambiguation anyway - i.e. the system can’t globally merge statements together without considering issues of provenance and consistency. I wrote about this issue here and at present this consideration is usually handled manually by the person operating the RDF store or software, but as these systems grow and scale more of these issues will need to be addressed by software.

Note that it is important to distinguish between this and the issue of trust in the content of the communication - here I am purely talking about interpreting the meaning of the communication, specifically measuring term consistency between documents from disconnected sources.

Now if you take this this inevitable use of context at web scale as given, my question is: Could the semantic web bootstrap and scale better with a system that disambiguated entirely based on context and didn’t employ namespaces at all? (i.e. like human language communication).

So I’ve been thinking along the lines of a scheme where literals and bnodes are used in place of URIs in RDF documents. Vocabularies use literal terms in place of URIs, and the combination of terms are used to infer meaning in aggregate.

Non-determinism issues aside this approach does have a central advantage: it removes the coordination and bootsrap overhead associated with use of namespaced identifiers, and particularly with issues peculiar to URIs:

  • artificial namespaces mean there’s little term match serendipity between disconnected uncoordinated clients
  • pre-existing identifier schemes are commonly not valid URIs, making reuse difficult
  • URIs introduce unnecessary term ownership, authority and lifecycle issues
  • Other URI proprietary issues add to cognitive overhead: hash v slash, uri denotes document vs thing it describes

One particular advantage of the literals-in-combination approach is that data can be lifted from existing sources without the requirement to invent and translate identifiers into URI schemes. Currently translation of data into traditional RDF consists of two challenges:

  • converting the structure of the data into a triple graph
  • translating the identifiers into a URI scheme

Whereas the former is a one-shot deal for each data format, the latter frequently requires manual input for each document and is IMO the single biggest hurdle to putting data onto the semantic web.

Of course the downside of the approach is that software consuming the data needs to take a non-deterministic approach to term meaning. There is no globally correct answer to ‘does this term in this document mean the same as this one?’ - instead it is a function of both the context under which the documents were written and of the requirements of the querying client.
Unfortunately I suspect that as people try to get traditional w3c semweb technologies to scale up in web scale environments they’re going to find themselves in the same non-deterministic boat.

I’m experimenting with a literals and bnodes approach in my own software and will post updates to my blog.

I like listening to the talis podcasts because they motivate me to think about semantic-web issues. Unfortunately I usually spend the entire session muttering to myself because I disagree with so much that is said.

The issue for me is that speakers often paint a rosy view of a merged data world where ‘if only’ people would adopt RDF and share their ontologies, systems would be able to communicate and share data. OWL is commonly painted in a broad-brushed way as the mechanism that would then enable semantic web interoperability. I have my doubts - deterministic ontologies get complicated and brittle very quickly.

Here’s a test:

If atom were an RDF format (i.e. same data structure, just in RDF), could OWL realistically be used to allow an RSS1.0 app to interpret atom data?

I wrote this post partly as an advocacy piece and partly to put down a bunch of things I’ve learnt about the factor compiler over the last few weeks. I should point out that I’m no expert in this area and so there are probably inaccuracies and omissions - hopefully Slava or one of the factor gurus will point them out. With that in mind, check this out!:

Factor has an optimising compiler which generates machine code as you type code into the REPL. If you have gdb on your system you can see this in action by firing up factor and using the tools.disassembler vocab:

( scratchpad ) : hello "hello" print ;           ! defines the word hello
( scratchpad ) USING: tools.disassembler ;
( scratchpad ) \ hello disassemble

Using host libthread_db library “/lib/tls/i686/cmov/libthread_db.so.1″.
[Thread debugging using libthread_db enabled]
[New Thread -1213614400 (LWP 25688)]
0xffffe410 in __kernel_vsyscall ()
Dump of assembler code from 0xb0f694b0 to 0xb0f694c7:
0xb0f694b0: mov    $0xb0f694b0,%ecx
0xb0f694b5: mov    0xb0f694e4,%ebx
0xb0f694bb: mov    %ebx,0×4(%esi)
0xb0f694be: add    $0×4,%esi
0xb0f694c1: jmp    0xb123e7d0
…
End of assembler dump.

This is very cool in itself, but for me the real beauty of the factor compiler is the very modular design, composed of small pieces that you can pull apart and tinker with in isolation. This makes the compiler accessible to people both as a learning tool and for those wanting to generate highly optimized code for tight loops.

The three stages of the compiler are

  1. Parsing the code and generating a ‘dataflow’ abstract syntax tree. (Also called ‘IR’ - intermediate representation)
  2. Optimizing the dataflow tree
  3. Generating machine code from the dataflow tree

I’ll dig into each of these steps in order:

Stage 1: Parsing factor code to dataflow IR

The first step parses factor code into a dataflow datastructure. You can run and inspect the results of this yourself using the dataflow word:

USE: inference
[ "hello" print ] dataflow pprint

=> T{
    #push
    T{
        node
        f
        f
        f
        V{ T{ value f “hello” 673850 f } }
        f
        f
        f
        f
        f
        f
        T{
            #call
            T{
                node
                f
                print
                V{ T{ value f “hello” 673850 f } }
                V{ }
                f
                f
                f
                f
                f
                f
                T{
                    #return
                    T{ node f f V{ } f f f f f f f f f }
                }
                f
            }
        }
        f
    }
}

Obviously inspecting this datastructure manually is pretty cumbersome, so fortunately there’s some dataflow inspection functionality in the ‘optimizer.debugger’ vocab. The dataflow>quot word renders the dataflow structure back into quotations (code blocks) that you can print and inspect. I use it here to define some words for dataflow pretty-printing:

USE: optimizer.debugger
: print-dataflow f dataflow>quot pprint nl ;
: print-annotated-dataflow t dataflow>quot pprint nl ;

So now we can turn quotations into dataflow graphs and back again:

[ "hello" print ] dataflow print-dataflow
=> [ “hello” print ]

(N.B. there are already words in the optimizer.debugger vocab for displaying optimized dataflows, but for this post I wanted to be able to print dataflows prior to optimisation)

This also works for pre-defined words using ‘word-dataflow’ :

: print-hello "hello" print ;
USE: generator
\ print-hello word-dataflow print-dataflow
=> [ “hello” print ]

In most cases the output quotation will be the same as the input quotation, however there are a couple of expansions that happen at this stage. The first is that words marked ‘inline’ are inlined directly into the dataflow:

: inlinedword "this" "is" "an" "inlined" "word" ; inline
[ inlinedword ] dataflow print-dataflow
=> [ “this” “is” “an” “inlined” “word” ]

Also any compiler-transforms (macros) are evaluated at this stage.

USE shuffle
[ 1 2 2 ndup ] dataflow print-dataflow      ! ndup is a macro
=> [ 1 2 2 drop 2 drop >r dup r> swap 2 drop >r dup r> swap ]

Stage 2: Dataflow Optimisation

Here’s where the fun starts. You can get a feel for how this stage works by looking at ‘optimizer.factor’. Here it is in its entirety:

! Copyright (C) 2006, 2008 Slava Pestov.
! See http://factorcode.org/license.txt for BSD license.
USING: kernel namespaces optimizer.backend optimizer.def-use
optimizer.known-words optimizer.math optimizer.control
optimizer.inlining inference.class ;
IN: optimizer

: optimize-1 ( node -- newnode ? )
    [
        H{ } clone class-substitutions set
        H{ } clone literal-substitutions set
        H{ } clone value-substitutions set
        dup compute-def-use
        kill-values
        dup detect-loops
        dup infer-classes
        optimizer-changed off
        optimize-nodes
        optimizer-changed get
    ] with-scope ;

: optimize ( node -- newnode )
    optimize-1 [ optimize ] when ;

‘optimize’ iteratively calls ‘optimize-1′ until nothing changes in the output graph - i.e. that it has reached a fixed point and no more optimizations can be performed. If you dig into the words used by optimize-1 (try executing them individually and inspecting the dataflow result) you’ll find that optimize-1 performs a number of inferences and optimizations:

  • It tracks the types (classes) of stack elements created within the code block
  • It inlines specific generic word implementations (methods) when it can deduce the class instance on the stack
  • It prunes unused literals and flushable words. (this is actually more useful than it sounds, since other optimisations can generate unused code)
  • It performs branch analysis, marking tail calls in loops and pruning branches that can’t be executed
  • It evaluates ‘foldable’ words at compile time if the values of arguments are known
  • It executes any custom inference code attached to words, allowing words to evaluate their results at compile time if inputs are known

Examples:

evaluating foldable words at compile time

‘+’ is a foldable word (see help for ‘+’), so the optimizer evaluates it at compile time if the values of both arguments are known. Here’s the dataflow before and after optimization:

[ 2 3 + ] dataflow print-dataflow            ! before optimization
=> [ 2 3 + ]

[ 2 3 + ] dataflow optimize print-dataflow   ! after optimization
=> [ 5 ]

type inference and inlining generic word implementations

To illustrate this we first create two tuples (classes) with constructors, and a generic word

TUPLE: classa ;
C: <classa> classa
TUPLE: classb ;
C: <classb> classb 

GENERIC: dosomething ( obj -- val )

Now we create an implementation of the generic word specialised for each class:

M: classa dosomething drop "something for class a" ;
M: classb dosomething drop "something for class b" ;

Finally, some code which calls ‘dosomething’ with an instance of ‘classa’ on the stack, before and after optimization:

[ <classa> dosomething ] dataflow print-dataflow            ! before optimization
=> [  classa drop  classa 2 <tuple -boa> dosomething ]

[ <classa> dosomething ] dataflow optimize print-dataflow    ! after optimization
=> [  classa 2 <tuple -boa> drop “something for class a” ]

It’s a little messy because of the inlined tuple creation, but you can see that prior to optimization ‘dosomething’ is a word call in the dataflow, and afterwards the optimizer has inlined the implementation of ‘dosomething’ specialized on ‘classa’. (if you look you can also see an example of pruning literals here, as the first dataflow has resulted in a superflous ‘\ classa drop’).

This is easier to see if I cheat a bit and use factor’s ‘declare’ word, which declares that elements on the top of the stack are instances of specific classes. So this quotation assumes that top stack element before it is called is of type ‘classa’:

[ { classa } declare dosomething ] dataflow optimize print-dataflow
=> [ drop “something for class a” ]

N.B. you wouldn’t normally use ‘declare’ in user code, but it could be really handy for optimizing performance sensitive tight loops where the results of an external word call are known to the programmer but not the compiler.

conditional folding

The compiler optimizes out conditional branches when it can deduce the outcome of the conditional at compile time:

[ 1 0 =  [ "do if true" ] [ "do if false" ] if ] dataflow optimize print-dataflow
=> [ “do if false” ]

1 isn’t equal to 0 so it optimizes this whole block into the contents of the false quotation.
This is a simple example, but it turns out to be really cool in performance sensitive code (e.g. tight loops) because you can use a generic library function whose behaviour depends on a conditional, specialize it with a hardcoded ‘f’ and the compiler will optimize the conditional branch right out of the resulting code. You get the elegance of the generic combinator with the speed of a hand coded loop.

Stage 3: Machine Code Generation

Code generation is implemented by the ‘generate’ word. This iterates through the nodes calling ‘generate-node’ on each.

: generate-nodes ( node -- )
    [ node@ generate-node ] iterate-nodes end-basic-block ;

: generate ( node word label -- )
    [
        init-generate-nodes
        [ generate-nodes ] with-node-iterator
    ] with-generator ;

‘generate-node’ is a generic word with specialized implementations for each type of dataflow node.

As described in this post from Slava’s excellent Factor blog there are a number of dataflow node types, the important ones being:

* #push - push literals on the data stack
* #shuffle - permute the elements of the data or call stack
* #call - call a word
* #label - an inlined recursive block (loop, etc)
* #if - conditional with two child nodes
* #dispatch - jump table with multiple nodes; jumps to the node indexed by a number on the data stack

The generate-node implementations invoke lower level words in the ‘architecture’ vocabulary, which in turn are generic words that write out small pieces of machine code specialized for each CPU architecture.

The machine code generation code is particularly cool and easy to follow because factor has an assembler DSL for each cpu architecture it supports. The assembler words match the commands and registers of the target cpu architecture and evaluate to their corresponding machine code.

You can even try this out in the REPL using ‘make’ to collect the results into an array. I’m on x86 so I load the cpu.x86.assembler vocabulary:

USE: cpu.x86.assembler

[ EAX 35 MOV ] { } make .   ! postfix assembler evaluates to machine code!

=> { 184 35 0 0 0 }

The assembler DSLs make code generation easy to follow because you can see the assembler in the generation code and then check it against the disassembled machine code using the ‘disassemble’ word we used at the start of the post. When following the code it helps to know which registers are used for what purpose. I found this information in assembler files in the factor VM source - I’m an x86 so for me the declares are in the ‘cpu-x86.32.S’ file:

#define ARG0 %eax
#define ARG1 %edx
#define XT_REG %ecx
#define STACK_REG %esp
#define DS_REG %esi
#define RETURN_REG %eax

#define CELL_SIZE 4

#define PUSH_NONVOLATILE 
	push %ebx ; 
	push %ebp

#define POP_NONVOLATILE 
	pop %ebp ; 
	pop %ebx

register CELL ds asm("esi");
register CELL rs asm("edi");

So lets check this against some generated code for a really basic word that just pops the number 42 on the stack:

: myfunc 42 ;
 myfunc disassemble
=>
Using host libthread_db library “/lib/tls/i686/cmov/libthread_db.so.1″.
[Thread debugging using libthread_db enabled]
[New Thread -1213696320 (LWP 32499)]
0xffffe410 in __kernel_vsyscall ()
Dump of assembler code from 0xb136b230 to 0xb136b247:
0xb136b230: mov    $0xb136b230,%ecx
0xb136b235: mov    $0×150,%ebx
0xb136b23a: mov    %ebx,0×4(%esi)
0xb136b23d: add    $0×4,%esi
0xb136b240: ret 

The first assembler line puts the address of the word into the XT_REG, which is %ecx. For some reason the start of each function puts it’s address into this register - not quite sure why.
The second line puts the number 42 into the ebx register. Note that factor uses the first 3 bits of a value (’cell’) to store its type (called a tag - see layouts.h). In this case it’s a fixnum which is 000. 42 shifted left 3 bits is 336, which in hex is 0×150.
The third line puts the number onto the stack, and the forth updates the stack pointer to point to the new top of the stack.

code generation optimizations

Factor has another couple of tricks up its sleeve during the code generation stages:

optimizing shuffle words

The first is that stack shuffle words (e.g. dup, swap, tuck etc..) don’t get translated into machine code. Instead the compiler has a compile time ‘phantom stack’ which records the positions of items in the stack. When it generates the machine code values are accessed from the stack out of order (the runtime stack is after all a random access piece of memory). This makes stack shuffling words and the retain stack effectively ‘free’ within a code block. A #merge node in the dataflow signifies a code boundary (usually before a subroutine call) which causes the compiler to output instructions which synchronise the physical runtime stack with its phantom stack.

word intrinsics

The second trick is word ‘intrinsics’. Word intrinsics are essentially blocks of open-coded assembler that are output in place of a subroutine call. They are associated with the word via a ‘word-property’, which is a nifty feature of factor that allows meta information to be attached to each word. For example the ‘fixnum+fast’ word has intrinsics which you can see using ‘word-prop’:

 fixnum+fast "intrinsics" word-prop pprint
=>
{
    {
        [ “x” operand “y” operand ADD ]
        H{
            { +output+ { “x” } }
            { +input+ { { f “x” } { [ small-tagged? ] “y” } } }
        }
    }
    {
        [ “x” operand “y” operand ADD ]
        H{
            { +output+ { “x” } }
            { +input+ { { f “x” } { f “y” } } }
        }
    }
}

This tells the compiler to inline the x86 ADD instruction instead of making a subroutine call to the implementation of fixnum+fast. You can add assembler intrinsics to existing words with ‘define-intrinsics’; Here’s a description from the help for the define-intrinsics word:

Defines a set of assembly intrinsics for the word. When a call to the word is being compiled, each intrinsic is tested in turn; the first applicable one will be called to generate machine code. If no suitable intrinsic is found, a simple call to the word is compiled instead.

What I particularly like about this feature is that it neatly provides the ability to specialize a highly optimized implementation for a particular hardware set, and then fall back gracefully on other architectures.

That concludes my ad-hoc tour of the factor compiler. I’ve skipped over a number of things and no doubt there are bits I haven’t discovered yet and some inaccuracies, but I hope I’ve supplied enough information to spark interest in this excellent compiler.

As I mentioned in a previous post I got interested in factor as a direct result of tinkering with jonesforth, which takes you through the entire forth bootstrap process starting with raw assembly. I’ve been delighted to find that factor retains a lot of the ‘right-down-to-the-metal’ accessibility of its low level cousin.

I finally got my F5D7132 wireless repeater working. The trick was to ignore all the auto-negotiate ‘one push setup’ rubbish and actually connect to the web interface of the device.
I don’t use windows, so the installation CD was no good to me, but I found some docs from Belkin that the NIC for the device is on IP 192.168.2.254 by default. The trick was to connect my laptop via ethernet to the repeater and configure the laptop nic to be another address on that range

ifconfig eth0 192.168.2.2

Then I was able to point my browser at http://192.168.2.254/ to configure the repeater, adding the right SID and connection details and manually finding the right wireless network - Yay!

Motivated by Tom Moertel’s ‘A coders guide to coffee’, I’ve been experimenting with roasting my own coffee on the cheap. Here’s my equipment bought to date:

- Bodum 5679 C-Mill Electric Coffee Grinder
- Rival popcorn popper (cost me a fiver from ebay).
- Aeropress brewer
- Digital oven Thermometer 100169 E19

N.B. I didn’t start off roasting: a couple of months ago I bought a grinder and started ordering roasted beans from hasbean. If you live in the UK then HasBean comes highly recommended - the coffee has usually been roasted on the morning of dispatch and the beans arrive through the door the next day in a vacuum sealed bag.

I happily made filter coffee for a couple of weeks before Jay at work encouraged me to get an Aeropress. This is a cheap device that allows you to force hot water through the grinds under pressure to create shots. It’s not the same as a 500 quid espresso maker but I’ve been really happy with the quality of brew I get from this contraption; I don’t think you can get a better cup of coffee for the price and it’s a lot better than the filter in my mind.

Anyway, it was only a matter of time before the lure of subverting cheap consumer appliances to roast coffee proved too tempting. I found a wealth of info on the net about roasting coffee in popcorn poppers and bought a cheap popper for a fiver off of ebay. The best guides I found were:

There’s also a guide to modding your popper on engadget, although I’m not convinced this is strictly necessary. Ed Spiegel’s guide has numerous tips to change the roasting speed and profile without modifying the innards simply by varying bean mass, the tilt of the popper, using an extension cable (which adds resistance and slows the roast) and stirring the beans.

Also the Sweet maria’s site has loads of material on coffee roasting and brewing in general. In particular I found this colour chart really helpful. If you live in the US then consider getting your beans from them as people rave about them all over the internet.

I’m certainly no coffee expert, but I really recommend trying fresh coffee you’ve ground yourself. I understand that ground coffee goes stale in a matter of hours, so pretty much anything ground you buy from a supermarket is already well off by the time you brew it regardless of the vacuum packaging it comes in. Roasted whole beans go stale in a couple of weeks apparently, so it’s best to order them from a supplier that indicates the date of roast. In the UK this means getting them from the internet. If you need more persuading, check out ‘A coders guide to coffee‘ for some top quality coffee advocacy.

At this stage the benefit of home roasting for me is mainly that I can roast what I need for the next few days and since we only drink a couple of cups a day that’s not very much. Also the green beans are pretty cheap even for the gourmet stuff and keep well for years. As I get better at roasting I’m hoping I’ll develop the technique and a palette that’ll give me pleasure experimenting with varieties and varying roasts.
I’m not at that level of precision yet, but I’m certainly getting better tasting coffee than I pick up at starbucks.

Yesterday I was musing over why it took me so much longer to get going with Factor compared to the other languages I’ve learnt over the years. I came up with this:

Factor’s base language is very fundamental and primitive in nature. Programming it is similar to programming with assembler language: you have to keep track of the computation machinery in your head.

Where factor differs from assembler is that careful combination of primitives allows the level of abstraction to be ramped right up. This means that the real productivity action happens not in the base language but in the additional language components built on it in the extensive packaged libraries. Unfortunately until you’re familiar with these libraries you’re stuck programming at a nut-and-bolts level of abstraction.

Contrast this to your typical applicative language, maybe python or ruby: here the core language includes some useful abstractions which allow you to be relatively productive without having up-front intimate knowledge of the libraries. A couple of hours learning the ideas and syntax is enough to get you up and running and feeling like you’re progressing.

So the result is: Factor’s initial learning curve is brutally steep (unless maybe if you’re already proficient in stack languages). Getting to productivity takes time and study, and unfortunately probably more than you’re prepared to give up front if you’re just checking out the language.

Uncovered an awesome ssh trick today. At work we use ssh extensively to run stuff on remote unix and windows servers, using SSH agents to handle batch job authentication. That’s all sweet because on unix you don’t need to install client software on each box, and you don’t need a root account - just copy a public key and you’re good to go. In a big multi-national company not installing stuff means less bureaucracy, and that counts for a lot.

Now unbeknown to me it appears that on top of all that good stuff you can also multiplex ssh sessions over existing connections using just vanilla openssh 4. This is tres-bien-cool for a couple of reasons:

1) Connection times are massively reduced. Like a few milliseconds rather than a couple of seconds.

2) Lots of connections to the same account only consume a single tcp connection and don’t load the cpu with handshakes.

OMG! Surely this makes ssh the killer app for a whole slew of things: monitoring remote processes, interactive remote applications, infact anything that you’d previously have installed client software for. Now why isn’t there a tool industry built around this awesome protocol?

Dan Ehrenberg completes a second short post on garbage collection, expanding on his excellent ‘Quick intro to garbage collection‘ post. Although Dan’s focus is on research prior to implementing a better collector for the factor language, the lucid explanations and chatty style make this a must-read for anybody with a passing interest in language runtime systems.

Is it just me or do whitebox unit tests really bog you down?

I do pretty much all my coding in a test-first stylee; it’s the only way to code if you’re snatching 20mins here and there for spare time projects. Much of the time these tests serve as scaffolding to keep me on the straight and narrow while I bootstrap up some functionality. Unfortunately after they’ve served this purpose they just sit there like a ball and chain round my leg slowing any future change in direction.

These days I’ve got into the habit of converting these tests into more stable blackbox functional tests once there’s enough actual functionality to support it. Or I just delete them. Life’s too short to be worrying about breaking brittle old tests.

Learning Maths suddenly got easier when Kalid entered the scene. Better Explained has a ‘A Visual, Intuitive Guide to Imaginary Numbers‘ that’s every bit as penny-droppingly fantastic as his guide to exponential functions and e.

Trigonometry is great, but complex numbers make ugly calculations simple (like calculating cosine(a+b) ). This is just a preview; later articles will give you the full meal…

Kalid’s excitement for maths is infectious and I think this is because he gives off the impression that he’s only just got it himself and is desperate to share this enlightenment with the world.

A little utility I wrote for clearing up code: words-not-used. You give it a word and a vocabulary and it tells you all the definitions in the vocab that aren’t used by the execution of the word. It’s handy for clearing up old code that is no longer used.

(n.b. the word doesn’t have to be defined in the vocabulary)

Usage: e.g.
“phil.vocab” \ doit words-not-used .
{ unused-word
another-unused-word
… }

Here’s the code. Is it useful enough to go in the factor distribution?


USING: kernel assocs definitions vocabs vars sequences namespaces ;
IN: prune-code

VAR: wordset

: add-word ( word -- )
  "" swap wordset> set-at ;

: (recursive-uses) ( defspec -- )
  dup wordset> key?
  [ drop ]
  [ dup add-word uses [ (recursive-uses) ] each ] if ;

: recursive-uses ( defspec -- hashtable )
  H{ } clone wordset [ (recursive-uses) wordset> ] with-variable ;

: list>hashtable ( list -- hash )
  H{ } clone tuck [ "" -rot set-at ] curry each ;

! returns all the words in the vocab not used by the recursively by defspec
: words-not-used ( vocab defspec -- list )
  recursive-uses swap words list>hashtable diff keys ;

Thanks to all who commented to my previous post, it’s made me rethink and clarify my position on the problems with scaling Semantic Web technologies. I boiled it down to this:

Semantic Web clients beware: URIs are syntactically universal, not semantically universal.

The rationale for this is that although it is practically impossible for two disconnected parties to ‘mint’ the same syntactic URI, it is much more likely that two connected parties will use the same URI to refer to two similar but differing concepts.

The conceptual difference may just be missing qualification: e.g. one document refering to somebody in 1995 and another in 2001. Or it could be genuine misunderstanding: For example our unix server team includes the location of a server as part of its identity - when a server is moved to a different datacentre it effectively becomes a completely different server. Other teams assuming they understand the identity scheme get a surprise when the linked unix server data disappears.

Now this isn’t a problem with URIs per se: any other universal identifier scheme would exhibit the same trait. It is however a problem for Semantic Web software which commonly treats URIs as semantically universal without qualification. The bottom line is that before merging RDF graphs one must first manually compare the contexts of each source document to confirm that the URIs in them are mutually compatible.

I only just read Jim Hendler’s piece from last month “shirkying my responsibility”, in which he states that the W3C Semantic Web vision was never about a global shared ontology at all:

“Get it - we are opposing the idea of everyone sharing common concepts.”

This seems odd to me, because if that is the case and all communication on the semantic web is local then why is the basic system of identity the URI, a global identifier scheme?

On the contrary, I suspect that the W3C Semantic Web is predicated on global agreement: that all RDF documents containing a URI should use it to identify the same concept, otherwise the whole RDF inference stack breaks. A global ontology that’s defined in lots of inter-connected pieces scattered around the web is still a global ontology.

Jeff Atwood’s blog is a good source for collected quotes. He’s at it again in his latest post: Are you a Doer or a Talker?:

In software, some developers take up residence on planet architecture.’

‘Working code attracts people who want to code. Design documents attract people who want to talk about coding.’

In my experience talkers gravitate towards company technical architecture teams and committees. Luckily DKIB IT doesn’t have one, which I think makes it a refreshing place to work.

The Factor Attraction

It’s been a couple of months and I’m still slower writing code in Factor than in Python or Scheme. So why am I still writing code in Factor? Well it turns out that the problem is also the attraction:

You can’t hack in factor.

In python you can hack out code in multi-line functions, parking results in variables and combining them in subsequent lines. You can have a function that fills half the screen and manipulates >10 local variables and still trivially keep track of what’s going on.

Factor on the other hand scales horribly both with respect to lines of code per word (function) and the amount of local state (number of variables). This is because local state is manipulated on a stack which you have to keep track of in your head. Anything exceeding 3 variables and a single line of code and the cognitive load starts to ramp up exponentually. Four variables and the mind is constantly distracted trying to keep track of the stack order. Five and you’re spinning wheels, going nowhere fast.

So instead you’re constantly having to come up with neat composable abstractions to fold up the state. This is the sort of thing that makes code elegant, loosely coupled and small in any language. However where as with other languages splitting your code into neat composable blocks is considered good practice, in factor you simply don’t have any choice: you just can’t keep parking state on the stack.

Not surprisingly factor is absolutely jam-packed with abstraction mechanisms: blocks, partial application, objects, generic functions, parsing words (macros), dynamic variables, namespaces, modules. You name it. And because the language forces tiny composable reusable functions the library is packed full of them.

Plus, factor has excellent tools for traversing code. This is super important in a language which encourages thousands of discreet one liners and I think this is the first time I’ve seen this stuff done properly without tying you to an IDE. (in fact there’s so much good here that I should probably devote a whole post to it)

All this may actually make factor is the ultimate teaching language. Why? - well in other languages the novice programmer can write an awful lot of procedural code using just functional decomposition before feeling the pressure to look for other abstractions. With factor that pressure is there almost from the start and so the novice is forced to search the bigger picture for abstraction possibilites almost constantly. I think it’s this abstraction pressure which helps people to ‘get’ language features in an intuitive tanguable way, and that’s pretty much the point of a teaching language.

So will the factor attraction last with me? The interesting thing will be to see if I ever get up to ’scheme speed’ with factor. At the moment I have to be really switched on and concentrating each time I code. I’m forced to step back all the time to see if there’s a better way to do something, and I’m constantly re-writing stuff. I’m sorta hoping this is just early days - that at some point I’ll get to a level of experience where I spot the bigger ideas much quicker. Either way, factor is a lot of fun to use and is compelling in the way an interesting puzzle is. And ultimately I’m sure this is making me a better programmer.

Stumbled on this intuitive explanation of ‘e’ via a reddit comment today. Absolute internet solid gold - this is the first time I’ve actually understood exponential functions rather than just plugging ‘e’ into a formula.
The rest of the better explained site looks promising too!

A nice feature of distributed version control is that you can commit into your repository more often because you aren’t impacting others with each commit. Recently I’ve been taking this to the extreme using git and performing a commit almost at every save. I have an emacs key wired for the job: it save the buffers and then runs ‘git commit -a -m “checkpoint”‘. During a coding session I hammer it frequently.

I’ve found this approach particularly useful when I’m programming with unfamiliar tools (as I am at the moment with factor). I tend to re-write stuff a lot and occasionally change my mind mid-rewrite, wanting some old code back. Using git in the above manner makes it painless to checkpoint my work continuously and pull back changes from previous revisions when appropriate.

However there’s a problem when I want to share my repository with others: Nobody else is interested in the 100s of incremental checkpoint commits with no commit messages; they want to see commits in functional units which tally with the changelog. To make things more legable for them I need to be able to roll up the checkpoints into functional commits with full commit messages.

Originally I assumed that the best way to do this was to have two branches: one for checkpoint commits whilst developing, and a second public one for containing the larger functional commits.

o--o--o--o--o--o--o--o--o  < --- checkpoint (dev) branch
  /  ___/ ______/
 /  /  __/
|  |  |
v  v  v
o--o--o                    <--- public branch ('proper' commits)

I tried various approaches to merging multiple commits from the checkpoint branch into single commits in the public branch, but couldn’t find anything that worked.

I think the main problem with the 2 branch idea is that once you’ve rolled up the commits from one branch into a single commit in the other (e.g. with git merge –squash), they no longer match and so the branches don’t have a recent common ancestor. This means git is unable to track the histories by checksum and so each subsequent merge results in conflicts that must be resolved by hand.

I found the easiest way to get round this was to just copy the content over from one branch to the other rather than merging it. This could be done either by creating patches or by simply checking out the contents of one into the other with ‘git checkout branch .’ (i.e. checkout branch <path>). This removes the need to resolve conflicts, but the branches still don’t share common ancestors. Ultimately you have to manage the branches separately - for example you have to pull external changesets into each branch individually.

In the end the best way turned out to be to dispense with the 2nd public branch all together and just operate in one. My method is as follows:

I tag the most recent public commit in the branch, and then perform lots of checkpoint commits as I code. When I’m ready to roll up the checkpoint commits into the next ‘proper’ commit I go back to the previous public commit with:

% git checkout public

Then, assuming master is the current branch I checkout the contents of the HEAD of the branch (i.e. all the checkpointed commits) into the working directory, but without moving the index:

% git checkout master .

Then I move the HEAD of the master branch to this point. I do this by deleting and recreating the branch again:

% git branch -D master
% git checkout -b master

Finally I commit the changes and tag this as the new latest public commit:

- git commit -a
- git tag -f public

And that’s it. Now I haven’t been using this technique for long, so there’s a good chance that something might trip me up in the future - If anybody can see a problem with this (or a better way) then I’d really appreciate a comment.

I’ve been playing with Factor for a couple of weeks. I’m finding that it takes me quite a bit longer to write stuff with factor than with other languages, but the process is enjoyable and I get the feeling that I’m learning something useful each time. The question is: will programming speed improve with experience. And more to the point, eventually, will I be able to write code faster in factor than in gambit scheme?

Here’s a task I’ve been trying to accomplish the past couple of evenings: converting tabular data into triples. I’ve written this functionality before in both python and scheme before so it’s a good exercise for developing factor experience.

First, here’s the code in python:


def row_to_triples(i,cols,row):
    return [ (i,col,cell) for col,cell in zip(cols,row) ]

def tabular_to_triples(startid,cols,rows):
    triples = []
    for i,row in zip(range(startid,startid+len(rows)),rows):
        triples += row_to_triples(i, cols, row)
    return triples

Which works like this:


>> print tabular_to_triples(0,
                ["col1","col2","col3"],
                [["a","b","c"],["e","f","g"]])
[(0, 'col1', 'a'), (0, 'col2', 'b'), (0, 'col3', 'c'), (1, 'col1', 'e'), (1, 'col2', 'f'), (1, 'col3', 'g')]

Here’s the equivalent in gambit scheme using srfi-42 eager comprehensions:


(include "srfi-42.scm")

(define (tabular->triples startid rows cols)
  (define (row->triples id cols row)
    (list-ec (:parallel (:list i cols) (:list j row))
             (list id i j)))

  (append-ec (:list row (index i) rows)
             (row->triples (+ i startid) cols row)))

Which yields:


>> (tabular->triples 0
                     '((a b c)(d e f))
                     '(col1 col2 col3))

((0 col1 a) (0 col2 b) (0 col3 c) (1 col1 d) (1 col2 e) (1 col3 f))

ASIDE: If you’re evaluating gambit scheme (or scheme in general) be sure to check out srfi-42 and also Alex Shinn’s ml-style pattern matching module from http://synthcode.com/scheme/ . Gambit scheme out-of-the-box is like abstraction assembly language - you need tools layered over the top for real world use.

Ok, so on to factor. My first attempt was:


: row>triples ( cols row rowid -- triples )
  [ -rot 3array ] curry 2map ;

: process-row ( rowid cols row -- rowid cols triples )
  rot 1 + swapd                           ! inc rownumber
  [ swapd row>triples ] 2keep swap rot ;

: tabular>triples ( start-rowid cols rows -- triples )
  [ process-row ] map concat 2nip ;

Which works like this:


>> { "c1" "c2" "c3" } { { "a" "b" "c" } { "d" "e" "f" } } 0 tabular>triples
>> .
{
    { 1 "c1" "a" }
    { 1 "c2" "b" }
    { 1 "c3" "c" }
    { 2 "c1" "d" }
    { 2 "c2" "e" }
    { 2 "c3" "f" }
}

This is reasonably compact, but looks horrible to me and is pretty difficult to follow due to all the stack shuffling. I’m starting to learn that the way to eliminate stack shuffling is to use combinators, and try to factor out the common code.

I did a bit of functional de-composition with the hope that creating words for the pieces would yield clearer code. The pieces of functionality needed were:

- creating a triple from 3 elements on the stack. Handled by 3array
- enumerating a variable whilst mapping through a sequence
- holding a variable whilst mapping through a sequence. Handled by ‘curry map’
- mapping two sequences in parallel (columns and rows). Handled by 2map
- concatenating sequences (rows) of triples together: concat

Actually once I’d worked out the which words I wanted I found that a most of them already existed. In fact all of them except ‘enumerating a variable whilst mapping through a sequence’, so I wrote ‘map-with-counter’ to provide this. It prepends enumerating code to the map quotation before calling map, then cleans up the index variable at the end:


: map-with-counter ( start seq quot -- newseq )
 [ [ dup 1+ swap ] dip ] swap compose map nip ;

And so few attempts later and I’ve got:


: row>triples ( rowid cols row -- triples )
  [ 3array ] curry** 2map ;

: tabular>triples ( start-rowid cols rows -- triples )
  [ row>triples ] curry* map-with-counter concat ;
 

Which I’m quite pleased with. However in addition to map-with-counter I’ve had to create a new partial application combinator: curry**, which is general but doesn’t exist in the standard library. A sure sign that I’m doing something wrong, or at least differently to other factor developers.
And while I was writing curry** I ended up creating same new stack shufflers:


: rotd ( a b c d -- b c a d )  >r rot r> ;
: -rotd ( a b c d -- c a b d ) >r -rot r> ;
: dupdd ( a b c -- a a b c ) >r dupd r> ;

! partial application of quot based on 3rd item of stack
! see curry, curry*
: curry** ( param obj obj quot -- obj obj curry )
  rotd [ -rotd call ] 2curry ; inline

The underlying theme to these extra words is dealing with the 3rd element in the stack and below. So does that mean that seasoned factor developers tend to switch to some other mechanism when there’s more than 3 stack variables in play? Sounds like a question for the mailing list…

I got round to writing my first factor module tonight: a csv parser.

Actually there’s already a csv parser written in factor by Daniel Ehrenberg, but it’s been removed from the latest factor releases. I found a copy of that code here but I don’t know for how long it’ll stay there.

Unfortunately I had two problems with this existing module: The first was that it ran pretty slowly (1M of csv took ~5 seconds to parse on my laptop) mainly because my copy of factor wouldn’t compile the state-parser module that it depends on so it ran un-optimized. The second was that I needed a parser that could parse a row at a time for reading huge csv files in chunks. I took that as an opportunity to write my own.

The code performs ok (~500ms for 1M csv) and parses all the examples on the wikipedia csv page, but I can’t help feeling that I’ve written it in a similar style to what I would have done if I were using scheme. If anybody has any hints on ways to make the code smaller or faster or more elegant then I’d be delighted.

USING: kernel sequences io namespaces combinators ;
IN: csvparser

DEFER: quoted-field

: not-quoted-field ( -- endchar )
  ",\"nst" read-until   #! "
  dup
  { { CHAR: s  [ drop % not-quoted-field ] } ! skip whitespace
    { CHAR: t  [ drop % not-quoted-field ] }
    { CHAR: ,   [ swap % ] }
    { CHAR: "   [ drop drop quoted-field ] }  ! "
    { CHAR: n  [ swap % ] }
    { f         [ swap % ] }       ! eof
  } case ;

: maybe-escaped-quote ( -- endchar )
  read1
  dup
  { { CHAR: "   [ , quoted-field ] }     ! " is an escaped quote
    { CHAR: s  [ drop not-quoted-field ] }
    { CHAR: t  [ drop not-quoted-field ] }
    [ drop ]
  } case ;

: quoted-field ( -- endchar )
  "\"" read-until                                 ! "
  drop % maybe-escaped-quote ;

: field ( -- string sep )
  [ not-quoted-field ] "" make swap ;

: (row) ( -- sep )
  field swap ,
  dup CHAR: , = [ drop (row) ] when ;

: row ( -- array[string] eof? )
  [ (row) ] { } make swap ;

: (csv) ( -- )
  row swap , [ (csv) ] when ;

: csv-row ( stream -- row )
  [ row drop ] with-stream ;

: csv ( stream -- rows )
  [ [ (csv) ] { } make ] with-stream ;

If anybody’s interested the module (inc tests and doc) is here.

Last year I embarked on somwhat of a journey to find a better language for my home projects after getting a bit frustrated by python’s lack of blocks and general cruftyness. After a couple of months of trying various different things I settled on Gambit Scheme for my spare-time data indexing project. A minimal core language with uniform syntax and macros. Lots of potentual for adapting and building language features that make sense to me. Sorted.

Then last week I got round to reading Richard Jones’s minimal forth code. A language compiler and runtime in a couple of pages of well documented X86 assembler; small enough to read and understand the whole system in a bus journey.

Like Scheme, Forth has the ability to construct new language features, allowing user code to get between the parser and the evaluator to modify the language itself. Also, like Scheme, it has a uniform syntax - words separated by spaces, which makes re-writing code on the fly practical. Both these features mean that a fully fledged programming environment can be bootstrapped up from a minimal core. And the hook is: that minimal core is so much smaller for Forth than it is for Scheme.

One thing lead to another and I got interested in stack languages. Now I’m looking at Factor and wondering…

Factor has ticks in all the right boxes: minimal core, machine code compiler, macros, continuations, lightweight threads and message passing concurrency. It’s also the tersest language I’ve seen - code to do something always seems a fraction of the size I expect it to be. On the other hand it’s so radically different to anything else I’ve programmed in and it makes my head hurt. Could it be a contender to knock Scheme off the top spot?

Older Posts »