Standing up for the real column stores
I am delighted to see Curt Monash has entered the debate on Mike Stonebraker’s *Real Column Store* blog post. I’ve already responded to Mike but Curt’s post brings up some other things worthy of consideration:
>There are some good things about [Mike Stonebraker’s] post, and some not-so-good. The worst paragraph is probably [the list of row-store vendors] which I question on two levels. First, the vendors cited don’t actually claim to be selling a column store; thus, the whole premise of Mike’s post is incorrect. Second, neither those vendors nor Mike are really correct. What Mike is really doing is differentiating, in his opinion,* good column stores from bad or mediocre ones.
I agree the debate needs to be clarified. At SAND we don’t see Oracle, DB2, or SQL Server as the competition. They’re great for transaction processing but we would argue just not great for analytic purposes. (In point of fact, we provide Nearline ILM extensions to help all of them work better.) We see vendors like Vertica, Infobright and now SAP’s HANA as our column store competition. Frankly we are glad they finally joined the party. SAND have been here for a long time and here’s some of what we’ve discovered…
Like row-level locking was to on-line transaction processing (where it was required to ensure every user could get consistent access to data and ensure both data integrity and performance in highly volatile environments) at SAND we believe you need to have an approach for high concurrent user access to analytic data. You need to spread the intelligence to as many people as possible and make it as usable as possible.
SAND believes generation based concurrency (GBCC) is the right approach to ensure scalable performance for thousands of users. Generation based concurrency control is a lock-less concurrency control scheme well suited to the read-mostly environments of data warehousing and analytics. Unlike OLTP, conflicts are infrequent and therefore transactions proceed without incurring the overhead and loss of performance that lock-based management schemes impose. GBCC results in faster transaction throughput that is predictably and highly scalable, something a conventional lock-based concurrency control scheme cannot achieve in a read-mostly environment. GBCC also provides a full spectrum of transaction isolation levels enabling applications to choose the isolation mode that best balances requirements.
Since SAND have been in business for over 20 years, we have innovative and creative ways to solve the complex problems that others in the data warehouse and analytic database market have yet to encounter. Getting a lot of users connected to SAND with great performance was one such challenge. We believe one day all analytic data warehousing databases will use GBCC. (And if someone has come up with other alternatives to deal with thousands of users while maintaining performance, we would be happy to hear about them.)
Performing operations on the smallest unit of work makes processing data more efficient and faster. While I’ll avoid the clichés of quantum physics (no semi-dead-cat — or is it semi-alive? — here) it does help to use analogies to understand the power of what’s going on here. If we need to move a boulder from New York to San Francisco, there are a lot of logistics, transportation, cost and energy expanded to do so. If we were moving a grain of sand our options are legion, including flying it from coast to coast. While the boulder may be made up of the equivalent of billions of grains of sand, if I only need to move one grain, why should I expend the effort of moving the entire boulder? It’s inefficient.
BVAT technology allows database algorithms to be expressed as operations on simple integer data rather than involving much less efficient character string data or resource-hogging floating point data. (Grains of sand rather than boulders). Processor execution units are at their core integer processors, and therefore superior performance is achieved when database algorithms are written as operations on integer data. SAND’s BIT-vector technology provides a highly efficient means of representing and manipulating sets of integer data. The combination of set-oriented processing and integer-based algorithms enables extremely high database query performance.
Those are just two examples but if we’re discussing the real column stores, let’s discuss the real column stores. Again, I fully admit bias but that doesn’t mean we’re wrong.