Challenging the column store criteria
*The following is a response to Mike Stonebraker’s recent blog post,* Will the Real Column Stores Please Stand Up? *I originally left it as a comment to that post but so far it has not been published. I can’t imagine they have a very long moderation queue or that Mike is shy about getting into the debate. I know we at SAND we are not scared of calling out our competition. Hopefully Mike gets my response to his post up soon. Until then you can read it below, and Mike is more than welcome to respond here.*
Reading the recent blog post on 6 criteria for a column store database I felt like we were suddenly back in the 1980s. (I could feel the scratchy wool of my Prince of Wales double-breasted suit as if it were yesterday.) The post felt like a call back to the Ted Codd articles of yore, and I agree with Seth Grimes that vendor-driven “6 things we all must be” treatise need to be seen as marketing projects and not independent analysis.
While openly declaring myself COO of a rival company, I’d like to respond to the post while steering clear of marketing and trying to offer a balanced perspective. It’s not that I think there’s anything wrong per se in what was said. However, I do think there were a lot of omissions and a lot left to interpretation. Obviously we at SAND believe we have developed the right way to deliver column store and our customers agree. That makes us biased. Being biased doesn’t mean we’re wrong though; just biased.
First, defining the competition as Oracle, Greenplum and Aster Data is like deciding to compete in a synchronized swimming gala against Michael Vick, Peyton Manning and Ed Reed. It may be sport, Jim, but not as we know it. Row based vendors slapping “column here, column here, column here too, oh and here” is like pigs with lipstick in their handbags. if we are going to look at what makes a column store we should compare ourselves with column store vendors. I am not confused nor afraid to say who they are — Vertica, ParAccel, Infobright and of course SAND. So let’s get in the sprinting race with sprinters and not compare our 100M sprinting performance to that of NFL players. We do different things and to win we do them differently. (See Curt Monash’s Columnar compression vs. column storage.)
Second, column store technology despite being commercially available for over 25 years is still not at full maturity. So, rather than trying to create a super-set and identify the 25 things that will make a column store the best little column store it can be, I will address Mike’s comments and endeavor to open the debate.
>IO-1 (basic column store) – Every storage block contains data from only ONE column
This is great if you were just starting out developing a column store. Another way to do this would be to go even more granular, and in the process use even less I/O. It is possible to separate the column information from the actual values by using domain tokenisation as we do at SAND. Queries that then do not need to work with the actual values are satisfied using only the tokens in the column, leading to considerably less I/O than when every storage block contains data from only one column.
>IO-2: Aggressive compression
The subject for this is very broad, a bit like asking your waiter, “Is the steak any good here?” As the most expensive item on the menu I always find the answer to that question is “Hell yes”. Well, do we want aggressive compression? What’s the alternative — Massive unnecessary data overhead? I know I don’t want that.
Starting with a row based system is like trying to compress concrete so column based compression is definitely the way to go. However, there are different routes to achieving aggressive compression. From employing compressed and encoded BIT vectors, using multiple token sizes (1-, 2-, 4-, and 8-byte tokens), quasi-binary domain (QBD) optimization, LZW compression of DOMAIN value-sets, and dynamic compression of MAP vectors like we do at SAND. Again, column store is the concept but the implementation matters a great deal.
(See our SAND Analytic Database Performance for more.)
>CPU-5: Executor runs on compressed data
Given that compression has been retrofitted onto most DBMS, they gain none of the benefits of working “within” the confines of the CPU’s L1, L2, and L3 data caches. Column stores should have cache aware algorithms developed to explicitly exploit greater memory bandwidth of the L2 and L3 data caches.
>CPU-6: Executor can process columns that are key sequence or entry sequence
The argument for reducing CPU activity by having the individual database pages contain sorted values is correct, and sorting the data during the load process is important, but for what purpose? When employing DOMAINs (in SAND we employ tokens in lieu of values where possible) our purpose for the sort is for token assignment rather than placement on a database page. As with compression it’s not just what you call it but how you actually do it.
A lot of the above 6 criteria are based on deep technical aspects. A database is about getting data into it then getting it out, but not in abstraction. I would add users into this mix. How do you get multiple users to connect to the database? How does it scale? What does it scale for? Data types? Data volume? Data volatility? How fast do we need to move the data through the database? What is the mix of queries? Should zero indexes be a requirement? Do we need domain based organization allowing for tokenized operation? What about specialized join algorithms (semi-joins, entity joins and match joins)? Generation Based Concurrency Control (GBCC)? How about full spectrum transaction isolation modes?
There’s a lot of technical ways to approach these problems. We need to let the analysts and the market work out what they see as the key criteria. (At SAND we believe we know what they are — we went out and built them.)
SAND meets all six criteria for a true column store listed in the post, however there is more to Vertica than these 6 criteria and obviously more to SAND.
Column store is the future for data warehousing, no amount of lipstick on the row-based pig will fix that. Rather than trying to mix sprinting and NFL, however, I’d welcome an open debate between peers. SAND is not concerned about stating who our competition is in the column store market – like Vertica and ParAccel — because we think our technology for column stores is superior. Of course we do. We’re biased; but we’re not wrong.