In the beginning there was the ODS or operational data store - typically a collection of tables in a relational database. The ODS contained the company's raw data which was used to run historical reports on past transactions. Then came the data marts (DM) essentially a snapshot of the ODS. The BI Analysts who were highly skilled statisticians and mathematicians ran queries against the DM, extracted data from the DM and performed their analyses. But all the analyses in the world could not deliver critical information in a timely manner to help management make better operational decisions to improve customer service and help the bottom line.
The good news is that DMs and Business Intelligence (BI) have evolved significantly over the past decade. It is no longer acceptable to have DMs that are just time-stamped snapshots of the ODS. The raw data in the ODS has to be cleansed, massaged and distilled...and I mean distilled...into useful information stored in formats like, cubes, dimensional entities and aggregates in columnar structures at various levels of granularity and recency so that the analysts can visualize, manipulate and study the data from different angles.
What has changed in BI in the past decade?
Data Frequency - real-time or close to real-time
Daily or weekly snapshots of the ODS are no longer acceptable for making decisions. This means that we need to have intra-day data marts as close to real-time as possible that contain only a subset of data that renders recent customer transactions and activities very quickly. Intra-day snapshots of data are trickle fed through data filters into real-time data marts, allowing operational personnel the ability to analyze events occurring during the day of the event. As you can imagine, this means that the volume of data being stored increases substantially. The main ODS and OLAP data mart now contain tens of terabytes to hundreds of terabytes (even petabytes!) to support all forms of BI now. And it should be noted that the granularity of the data for operational BI must be at the lowest level of detail. All this means that, not only does the ODS and OLAP DM infrastructure have to handle faster, more frequent loads of data, but seamless scalability is mandatory – whether it is for storing and processing increased volumes of data, or maintaining the integrity of the environment (backups, failovers, etc.)
Performance - instant gratification
While performance in traditional BI environments has always been important, it is now critical that responses to queries against the intra-day real-time data mart be almost instantaneous. In many instances, requests originate from and responses are received on handheld wireless devices. Communication is no longer one-way. In other words, users are able to send back comments and feedback which contain valuable information (unfortunately in unstructured textual format in many instances) that must be captured, parsed, stored and fed into the ETL process. Add to this the fact that the environment must still support traditional BI users. A mixed environment with ODS, the traditional data mart (OLAP) and the real-time data mart means the software must have the ability to prioritize and route queries, not only according to their importance to the enterprise but also based on data and response requirements.
Number of users - scalability
There was a time when Business Intelligence was the domain of only statisticians and mathematicians who extracted data from historical snapshots of the ODS to perform their magic. Gone are those days. Now BI and analytics are used by business users and even customers want to get into the act since they want access to their transaction history and past experiences at a particular store or web site. Here's another catch. Not all users are human. There are web-bots that roam in and around the data marts trying to detect patterns, do simulations and project outcomes. All of this puts a tremendous strain on the network, server and infrastructure that must be horizontally and vertically scalable to support the onslaught of requests, queries and analytics being thrown at them.
Unstructured Data - Facebook, YouTube, blogs and other social media
Unstructured data is increasingly being used to make critical decisions. The ETL process has to be retooled to handle different forms of textual, audio and video data. The technical challenges that this poses are quite steep since most BI environments are just not set up to handle unstructured data. The space requirements of the ODS and data marts are expected to grow exponentially because of this; processing capabilities required to parse and convert unstructured data into a structured format puts a tremendous strain on the processing and infrastructure requirements, which in turn has a direct impact on scalability and performance.
The following diagram illustrates a typical BI system of today.
While the "Daily ETL" process takes in data from all data sources to perform incremental updates on the ODS and OLAP data mart, the "Real-time ETL" also called an intra-day ETL accepts transactions through a filter configured to feed only transactions that are deemed to meet the "recency" criterion. This essentially limits the data to transactions and customer behavior that are needed to make decisions based on events occuring during the day. Unstructured data coming back from a customer's handheld device is staged in a daily customer tracking DB and fed into the "Daily ETL" process along with data from all the regular data sources. It is also fed into the "Real-time ETL" process so that it is immediately available in the Real-time data mart.
Data marts and BI systems of the future must be completely rules-meta-data-driven capable of self-learning and able to modify their behavior based on usage and data types that are yet unknown.