enjoying salad since 1978.

Sunday, August 21, 2005

False Continuums

People like to categorize things into continuums, I've noticed. Take databases: embedded (Berekely DB, Mnesia, Metakit, SQLite) vs. distributed (SQL Server, Kdb+, Matisse, FramerD). This distinction exists only because the authors have tended to be specialized in those ways, not because there is a fundamental reason that embedded and distributed can't be treated as axes and both implemented in a product.

In a moment we'll discuss what that would be like but first let's talk about the present: SQL + Java. In this world, mismatches rule: you access your data through specific patterns (Active Record, so masterfully woven into Rails comes to mind), use application-agnostic (Memcached, a poor man's Linda) and application-specific caching (everybody who's written a buggy LRU implementation to cache something in an app you've written, please raise your hand) layers between your massively expensive relational databases and your cheap appservers. You rely on a thin, flimsy pipe between the host language and your data, drinking the ocean through a leaky straw. Ever have to recompile your Java app because you missed a semi-colon or a parenthesis in your SQL? Ever wonder at night if your code is vulnerable to a SQL injection attack? This is a result of the mismatch I'm talking about.

Object databases like ZODB and embedded databases like Mnesia offer a unified type system with the host language but in doing so are tied to that specific language (Python and Erlang respectively in this case) and in some cases to a specific machine.

I want a database where I can store my objects on the network, with locality defined appropriately to the data's usage, (in my main memory if I need it or stored on a disk on a random machine somewhere otherwise: why do you think God invented the memory heirarchy? that reminds me: the network should replace the tape drive in the traditional memory heirarchy chart), with a unified type system so I can use the data structures already available to me: dictionaries in Python, structs/objects in Common Lisp, Collections in Java: all simulatenously on the same database. I want to be able to define query mechanims appropriate to my problem space and to work over the data with similarly appropriate means: whether through prolog-style backtracking, declarative statements like in SQL, vectors like in K, dataflow variables like in Mozart or E, frames in FramerD, Iterators in Java/C++, or even XQuery or RDF, it should be my choice since it's my problem. I should be allowed to make the tradeoffs since it's my butt on the line.

This is totally buildable. You can write the prototype in Lisp where you would already have a unified type system (code is data and all that jazz), easy ways to define new query mechanisms (hint: just use macros), fast I/O (every modern common lisp implementation is compiled native), and competitors too scared to follow you into battle (Blub paradox and what-not).

I guess you can tell what I spent my one-day vacation working on. I wish I had more to show than just a buggy B+-tree written in Lisp.

1 Comments:

Blogger Kevin said...

WRT memory hierarchy. The distributed flat filesystem that I'm working on supports an LRU interface with two implementations. One is local (just an in-memory cache) and the other is remote (memcached or whatever you want).

This way you get the best of both worlds as you can get much better performance for in-memory data as you can for remote data.

So you basically have four hierarchies:

local in-memory
local on-disk
remote in-memory
remote on-disk

There could be a slight duplication between local in-memory and remote in-memory if you configure a node to run both memcached and a storage node but there's no solid way to connect to memcached via shared memory right now (which would be SWEET).

Of course implementing a memcached in Java wouldn't be too hard.

I'm getting ahead of myself though.

4:20 PM

 

Post a Comment

Links to this post:

Create a Link

<< Home