In recent years, I’ve been seeing clients less interested in building data warehouses, and more interested in analytics tools—like Tableau and Qlik—or big data solutions like data lakes.
The reasons people are steering away from data warehouses are often the same. They tried building data warehouses in the past, and it took too long—and was too expensive. The requirements often changed by the time the warehouse was finally complete, and it was too time-consuming and complicated to make changes at that point.
Business users are particularly frustrated by the fact that every time they want a change to the data warehouse, they have to make change requests and get approvals. Changes take three months to complete, at which point users have just received the data and built the solution in a spreadsheet. This often results in the very problem the data warehouse was created to solve. With a plethora of spreadsheet solutions stored on local drives throughout the company, the data warehouse languishes—an unused, expensive experiment gone wrong.
Modern analytics tools try to get around this problem by providing feature-rich self-service tools. Tableau, for example, provides interactive data visualizations that enable users to easily view and analyze their data without the help of an IT team. But before the data can be placed into a usable dashboard, it needs to be massaged, manipulated, and merged—a specialized skill.
Data lakes, on the other hand, are designed to take in all your data in whatever format you desire. Getting the data into a data lake is pretty straight-forward, as long as you understand the technology for loading it in. But getting it out for analysis requires fairly specialized tools, often requiring a certain level of programming skill. Data lakes are very useful for large, unstructured data sets, but they don't solve the entire problem most business have: needing to combine various structured datasets into a common business-friendly structure for easy reporting.
So, if modern analytics tools don't master data modeling and blending, and data lakes are overkill for most companies, what’s the solution?
It needs to have a fast turnaround time—four weeks max, preferably two weeks or less—between new feature releases. It needs to be centered around rapid prototyping and have a strong collaboration between IT and business so insights are gained quickly. It can’t be overly bureaucratic. It needs to be responsive to users’ needs. It needs to solve the problem of how to combine data from various disparate data sources and present them fully-modeled and ready for analysis.
The solution is still a data warehouse—but a data warehouse built using an agile methodology.
For years, I built warehouses using a more waterfall approach, and I encountered the same problems every time. User requirements would change mid-project. Change requests and change orders were the order of the day. We were often left testing until the end of the project, meaning any changes the users needed could require entire re-writes of code. And delivery was often a mad rush as we tried to fit in all the changes at the end. Although we would usually get the data warehouse built within the timeframe, I always felt that there had to be a better, more efficient approach for us and our users.
In the agile methodology, the emphasis is on collaboration and rapid prototyping. I now focus on one very small area and get something built as fast as possible. That way I can get back to the user quickly with results, and they can let me know if we’re going in the right direction. If we’re on the right track, I polish it off and deliver that portion to them right away. They get something to use right at the beginning of the project and start getting used to the warehouse as soon as possible. They also start thinking more broadly about what they want, and start providing me with richer user requirements as they realize the value of what we’re building. If we’re not going in the right direction, it's still early enough in the project that it’s easy to change course.
A helpful tool
One very important aspect of the agile data warehouse is choosing the right tool. While you can be agile with SSIS or Informatica, they’re really designed for the old way of building data warehouses. My tool of choice is now WhereScape RED. This ETL tool is designed for rapid prototyping and getting results out the door in days—not weeks. In fact, I’ve managed to complete an entire star schema in WhereScape in hours.
Not only is WhereScape quick to develop in, but it also automatically documents your code, saving time on one of the most hated tasks of any development project. It also derives your entity relationship diagram and provides full end-to-end lineage analysis.
Now that you’re focused on rapid prototyping and short delivery cycles—and using a tool to get results fast—your impression of the data warehouse will greatly improve. It changes from an overly expensive chore, to a convenient place to get data analysis.
Your dashboards become quicker to build, because your dashboard developers no longer have to figure out how to combine various data sources and massage the data before bringing it into the dashboard. Data trust also improves—everyone is pulling their data from a single source of truth in the organization.
Any additional requirements are built into a backlog, prioritized, and delivered in short sprint cycles. Users are always kept abreast of where their request is in the queue, and when it will be delivered, with the aim to deliver requests in two-week cycles.
With the combination of the right approach, agile methodology, and the right tool—WhereScape RED—I believe the data warehouse is far from dead and will continue to be the foundation of data analytics for years to come.
Sean Hayward originally published this post on LinkedIn Pulse.