Nearline 2.0 and Advanced Data Modeling
In my last post, I discussed the “Quick Check” method for identifying the benefits that a Nearline 2.0 implementation can deliver in the areas of operational performance, SLAs and TCO. Certainly, it is easy to see how it would be preferable to manage a database that is 1 TB rather than 20 TB in size, particularly when it comes to critical tasks like backup and recovery, disaster recovery, off-site backups and historical analytics. Today, however, I want to focus on another benefit of Nearline 2.0 that is less obvious but still very important: data modeling flexibility.
Nearline 2.0 permits organizations to keep a much greater amount of useful data accessible, without requiring compromises on SLAs, TCO and reporting performance. This in turn makes a variety of flexible data modeling options available.
The Physical Table Partitioning Model
The first of these new data modeling options is based on physical table partitioning. The largest tables in a data warehouse or data mart can be physically divided between an online component and a Nearline counterpart. This allows the existing data model to be maintained, while introducing a “right-sizing” concept where only the regularly accessed data is kept online, and all data that doesn’t require such an expensive and/or hard to manage environment is put into the Nearline 2.0 solution. A typical rule of thumb for defining partition boundaries is based on the 90-day aging principle, so that any static data older than 90 days is migrated from the online warehouse to the Nearline 2.0 repository.
Now, many forms of enterprise data, such as CDR, POS, Web, Proxy or Log data, are static by definition, and are furthermore usually the main sources of data warehouse growth. This is very good news, because it means that as soon as the data is captured, it can be moved to the Nearline 2.0 repository (in fact, it is conceivable that this kind of data could be fed directly to Nearline 2.0 from the source system – but that is a topic for another post). Because of the large volumes involved, this kind of detail data has usually been aggregated at one or more levels in the enterprise data warehouse. Users generally query the summary table in order to identify trends, only drilling down into the details for a specific range of records when specific needs or opportunities are identified. This data access technique is well known, and has been in use for quite some time.
The Online Summary Table Model
This leads me to the second novel design option offered by Nearline 2.0: the ability to store all static detail data in the Nearline repository, and then use this as the basis for building online summary tables, with the ability to quickly drill to detail in the Nearline 2.0 repository when required. More specifically, the Nearline 2.0 solution can be used to feed the online system’s summary tables directly. The advantage of this implementation is that it substantially reduces the size of the online database, optimizes its performance, and permits trend analysis on even very long periods. This is particularly useful when looking for emerging trends (positive or negative) related to specific products or offerings, because it gives managers the chance to analyze and respond to issues and opportunities within a realistic time frame. A recent press release from Experian provides an excellent example of how business value can be produced by expert analysts using long-term historical data and the right set of tools.
Some organizations are already building this type of data hierarchy, using Data Marts or analytic cubes fed by the main Data Warehouse. I call this kind of architecture “data pipelining”. Nearline 2.0 can play an important role in such an implementation, since its repository can be shared between all the analytic platforms. This not only reduces data duplication, management/operational overhead, and requirements for additional hardware and software, it also relieves pressure on batch windows and lowers the risk of data being out of synch. Furthermore, this implementation can assist organizations with data governance and Master Data Management while improving overall data quality.
The Just-In-Case Data Model
Another important data modeling option offered by Nearline 2.0 relates to what we can call “just-in-case” data. In many cases, certain kinds of data will also be maintained outside the warehouse just in case an analyst requires ad hoc access to it for a specific study. Sometimes, for convenience, this “exceptional” data is stored in the data warehouse. However, keeping this data in an expensive storage and software environment, or even storing it on tape or inexpensive disks as independent files, can create a data management nightmare. At the same time, studies demonstrate that a very large portion of the costs associated with ad hoc analysis are concentrated in the data preparation phase. As part of this phase, the analyst needs to “shop” for the just-in-case data to be analyzed, meaning that he or she needs to find , “slice”, clean, transform and use it to build a temporary analytic platform, sometimes known as an Exploration Warehouse or “Exploration Mart.
Nearline 2.0 can play a very important role in such a scenario. Just-in-case data can be stored in the Nearline repository, and analysts can then query it directly using standard SQL-based front-end tools to extract, slice and prepare the data for analytic use. Since much less time is spent on data preparation, far more time is available for data analysis — and there is no impact on the performance of the main reporting system. This acceleration of the data preparation phase results from the availability of a central catalog describing all the available data. The Nearline repository can be used to directly feed the expert’s preferred analytic platform, generally resulting in a substantial improvement in analyst productivity. Analysts can focus on executing their analyses, and on bringing more value to the enterprise, rather than on struggling to get access to clean and reliable data.
At SAND, we have developed a new analytical offering based on this approach: “Database on Demand” or DBOD. Using DBOD, an analyst can specify their data requirements and have a 100% indexed RDBMS built directly from these specifications, ready for querying by front-end tools such as SAS, SPSS, Business Objects, MicroStrategy and so on. I’ll touch on that more in a future post.
Richard Grondin