enjoying salad since 1978.

Friday, July 06, 2007

Disk vs. RAM. Round 1.

Kevin says:
I think more and more scalable compute infrastructures are going to cheat and ditch disk (or memory buffered disk) in favor of all in-memory data structures or SSD.

Putting all of your data in memory is not cheating, it's just another trade off.

This is not a direct rebuttal to Kevin, it's just generic advice to the random engineer who might run across this. To any managers out there: I'm oversimplifying some issues. Your engineers will know which.

Let's look at what Hennessy Patterson (4th Ed) tells us about memory access times. I've ranked them fastest to slowest.

  1. Register: 250ps
  2. L1 Cache: 1ns
  3. RAM: 100ns
  4. Disk 10ms

We used to think of memory hierarchy layers being about an order of magnitude slower than the preceeding layer but that hasn't been true for some time. A disk access at 10ms is 1000 times slower than reading from RAM. If you really care about user request latency then you don't want to be hitting disk per-request if you can afford not to.

So are you comfortable adding several hundred milliseconds for a request? You might be if you're a seriously resource constrained startup and have a very large amount of data like your typical IR system such as Kevin's does. If you have a relatively small amount of data then keeping everything in RAM is a pretty amazing trade-off.

As a medium-sized startup, what if you could pay 30 grand and see that money have an immediate impact on those pesky, painful performance graphs you keep. Wouldn't you spend that?

But it's easy to think about those dollars and tell yourself: "well, disks also store bytes, why not stick our bytes in those instead and use this money to pay for an off-site to get away from our performance troubles?" I'm kidding, people don't actually say that out loud but I've seen some eyes bug out of some sockets when looking at RAM prices. It's just natural to want to use disk for on-line storage since disks store so much data and seems so much cheaper if you don't think about the latency difference.

But let's talk concretely. I've been talking about "small" and "large" amounts of data. Let's talk about a specific scenario.

You have a popular restaurant review site. Each review is less than half the size of this blog post. That's about 2k. A page has 20 reviews but because you don't have that much RAM, you need to read those 20 reviews from disk. It takes a quarter-second to just read the reviews from disk even with that fancy-pants storage array you bought.

You're very lucky, your site has 10 million reviews. Continuing in my oversimplification, that's like 20 gigabytes. A gig of ECC FB-DIMM RAM is going for about $100/gig so that is 20k plus the extra 10k for the machines to hold your new RAM. 30k to automatically drop a quarter of a second from each request seems pretty cheap to me. Not only is your site faster but it now has more capacity since it's not wasting a quarter-second per request just moving a disk head around and reading data into memory where it should have been to begin with.

One last point: It only feels like cheating because you have less to worry about and is more expensive. Unfortunately, there's a certain class of naive engineer who will think that they can build a system without the expense and with 99.9% of the gains. (naive engineers love to throw 99.9% around!) They'll try to convince you that you just need another layer of LRU caches or some more memcached installations. They'll be a hero! They'll get a Founder's Award! Their company's VCs will take notice and promise to fund anything they start! Fire that guy. Use his salary to buy more RAM.

9 Comments:

Blogger Signee said...

Some things to keep in mind with regard to this seeming cost vs. performance home-run.

If you do choose to build a large memory cache, the questions still remain:

1. How to get all the bytes from the disk to the chip? i.e. what mechanism will you use to get the relatively cold disk to the warm or hot cache?

2. Do you have the expertise in-house or at the very least in-framework (rails, django, spring, struts...) to get your memory cache full of the data that is most relevant to your users, and further, use it in the most advantageous ways for your application.

3. What are the factors that might impact the performance gained by going from disk to ram? Context switching in the CPU on a machine dedicated to a memory cache can be a killer. (to be fair, the latest version of memcached proved to reduce this issue by over 60% !!! on my servers)

12:24 AM

 
Blogger Steve Jenson said...

Great questions. I was assuming that whatever fancy database you're using would just continue using as much RAM as you can throw at it. That's been true in my experience with every database I've used (mysql, oracle, postgres, bigtable, sql server).

My fake numbers were made assuming you left every other cache you have alone so new expertiese needed.

12:32 AM

 
Blogger burtonator said...

Actually I think you're wrong here on some things (sorry).

It would be easier (and cheaper) to build a MySQL/PostgresSQL and memcached install than it would be to throw your whole DB into memory.

Most DBs don't really scale into cluster environments right now. You'd basically have to build your own sharded/federated database and this would cost a lost more in engineering time than it would to just buy the hardware.

Your example of course was with 20GB. You could buy a single machine image with 32G of ram for say 20k-30k.

You could buy a slave backup for another 30k to prevent system downtime in a failed master.

The problem really starts to come into play when you start to hit say 100G. You're going to want to shard this data so that you can buy cheaper machines with 1G DIMMs.

The sweet spot is probably 16 1G DIMMs right now so you'd buy 7 16G boxes and dedicated them to memcached. Then you'd buy 1-4 DB boxes to store your database on disk.

You'd then buffer 100% of your database in memory.

This works well for a LOT of scalable sites like Digg, Livejournal, Facebook, etc.

I think the problem you're pointing to is that some engineers assert that with an LRU you only need to keep 10% of your data in memory and get 99.9% hit rate.

Of course with memcached you can put 100% of your data into memory and get a 100% hit rate.

Your mileage may vary with the LRU efficiency though.

.... of course maybe I didn't understand your original post.

5:40 PM

 
Blogger Steve Jenson said...

Yo Kevin!

I'm asserting that there is no difference (except operational and system overhead incurred with using memcached) in putting 100% of your data in RAM in memcached vs putting 100% of your data in your database's RAM.

Database federation is completely orthogonal to my argument. Yes, at some point you have to shard your data. Knowing when to shard and keeping the shards in memory are orthogonal.

9:04 PM

 
Blogger burtonator said...

yo steve....

OK. I guess I was confused by this statement then:

"They'll try to convince you that you just need another layer of LRU caches or some more memcached installations. They'll be a hero!"

... more memcached installations would yield 100% caching or the whole DB in memory.

12:03 AM

 
Blogger Steve Jenson said...

I'm confused. Why would you cache 100% of your database in memcached? Isn't that a giant red flag that you picked the wrong database or at least picked the wrong schema or are missing some important indices?

11:17 AM

 
Blogger burtonator said...

You'd cache 100% of the whole DB because of the items I posted before.

* most existing OSS DBs don't have distributed caching.

* you can buy cheaper boxes but more of them to scale your cache

You *could* of course scale your MySQL install by using partitioning but there aren't many OSS implementations of this now.

12:00 PM

 
Blogger Steve Jenson said...

Let's go back to the recipe example site I was describing earlier. I like it because it maps well to what a lot of contenty web companies are doing nowadays.

Let's say that you have so many reviews for recipes that they no longer fit on one machine. Here is a simple way to partition it.

You have a schema and you place it on each database. Hash(RecipeId) => Partition to store Reviews for that recipe. No fancy OSS library required; it's like 20 lines of code and no distributed caching is necessary in your database to do that.

12:23 PM

 
Blogger Amit said...

Memory can sometimes be cheaper than disk.

Memory has worse storage per dollar than disk. But it can have higher bandwidth per dollar than disk, especially for random access. You need to calculate what's constraining you. If it's bandwidth per item, then memory might be cheaper; if it's storage space, then disk is cheaper; if it's latency, then memory is probably cheaper than the array of disks you need to lower latency.

Flash drives are another interesting point in the tradeoff space.

10:26 PM

 

Post a Comment

Links to this post:

Create a Link

<< Home