Optimize data platforms with Amazon Redshift and Tableau
How to optimize your data and reporting platforms with Amazon Redshift and Tableau.
Sujay Mimani | December 10, 2015
We recently implemented a big data/data warehousing project for a major record label.
Big data projects are full of challenges—ranging from the sheer volume of data (literally billions of records) and data organization to effective data retrieval. One of the most challenging aspects of this project was reducing the overall dashboard load times.
Average dashboard load times were reduced from approximately three minutes to less than ten seconds.
Before we started the performance optimization exercise, dashboard load times were close to a minute—some as high as three minutes—with about twenty concurrent users. By the end, we reduced the average load time of the dashboards to less than ten seconds with the same number of concurrent users.
Here, I’ll discuss our areas of focus (or optimization domains) and our approach.
It’s worth noting the technology stack referred to here is Amazon Redshift, the backend, and Tableau, the front end. The key: to optimize the performance of data retrieval from Redshift through Tableau.
Our work involved three focus areas, or optimization domains: Amazon Redshift, Tableau, and what I call the “Handshake,” or the overlap between Redshift and Tableau.
Amazon Redshift. We focused on three key areas: data model, physical aspects of the data model, and environment.
- Data model: This is the foundation of any data warehousing project. Create a data model that helps you mitigate the challenge of data organization without compromising on performance. Star schema works well with Redshift, and we’ve seen great performance with Tableau on top of it. An added advantage of the star schema model: it helps separate out the dimensional data from the measures, making maintenance of slowly changing dimensional-data more efficient.
- Physical aspects of the data model: It’s all in the way you deploy the data model. Select optimum compression encoding for maximum performance, and select a distribution style that results in little or no skewing of data across nodes. If distributing on key skews the data by a huge factor then consider even distribution of the fact table with dimension tables distributed across all nodes. Identify the most optimal sort order for your dimension and fact tables. It’s important that you designate foreign keys in the fact table as sort keys. Every sort key defined on a foreign key will allow you to slice and dice the data across any dimensional attribute of that dimension, such as gender, user type, age group, etc. in the case of consumer dimension.
- Environment: Review the hardware configuration (including node type) of the cluster, keeping in mind overall data requirements, cost, and performance throughput. Perform multiple iterations of concurrency testing with various number of nodes and node types to benchmark performance.
Tableau. Here, we focused on server configuration, dashboard design, and data connection design:
- Server configuration: Determine the right number of VIZQL servers, Data Servers, and Cache servers to be deployed. We saw great performance with four VIZQL servers, four Data Servers, and four Cache servers to cater to the dashboard requirements with up to 40 concurrent users. The more servers, the better the ability to help cache more information, including faster rendering of views. We also discovered that publishing a workbook with live connection embedded resulted in better cache utilization than using a published data source.
- Dashboard design: Aim for four to five views within a workbook tab. This helps limit the number of queries being sent to the database at a given time—giving the database more resources to execute those queries simultaneously and improving performance.
- Data connection design: It should be sourced directly off the base data model in the database—here, a star schema. Do not custom code or denormalize the dimensional data completely, because it hurts performance when loading filters. Within quick filters, use “all values” where possible instead of “relevant values.” Using the latter causes the dimensional data to join with the fact table (which could have billions of records) to bring back only the relevant dimensional data, thereby hurting the performance of the overall dashboard load time. Using “all values” does not create the join with the huge fact table and returns a set of distinct dimensional attributes queried directly from the smaller dimension table.
The “Handshake.” The overlap between Redshift and Tableau, which I like to call the “Handshake,” focuses on query efficiency, concurrency setting, and aggregation.
- Query efficiency: Capture queries that take a long time to execute, and verify their “actual query execution” plans through the Redshift console. Identify the most optimal way to write the query and have Tableau generate it (which could be a simple or a complex task).
For example, we came across how Tableau was generating a “top 10” query. It was taking about three times longer than what an optimum query would have taken—but making Tableau generate the right query helped us boost performance by a factor of three.
Try these configuration changes through the TDC file to increase the overall throughput: Use a TDC file in the Tableau Server directory to turn off the unnecessary queries that Tableau fires to Redshift—this helps in limiting the number of queries to be run on the database; using this file you can also turn off cursors at the Redshift level.
- Concurrency setting: Modify the concurrency setting on both the Tableau and Redshift sides.
On the Tableau side, modify it to send more queries in parallel to Redshift. Tableau recommends a concurrency level of two, but we worked successfully with a concurrency level of ten. This means that each VIZQL server can send ten queries at a time to Redshift—or forty queries in total if you have four VIZQL servers—helping to increase throughput with multiple concurrent users.
On the Redshift side, modify the setting so that it can process more queries at any given time and reduce the average queue time of a query. Redshift recommends a maximum of 15 per queue (and maxes out at 50), but we worked successfully with up to 25. Maintaining separate queues for separate user types—such as dashboard users versus ad-hoc users—also helps process more queries at a time with dedicated database resources in terms of CPU and memory for each queue.
- Aggregation: Wherever possible, aggregate data to avoid churning huge amounts on the fly. Most importantly, aggregation should be done in a way that can be consumed most efficiently by Tableau. For example, aggregating across time dimension, such as week or month, or set time periods, such as the last six months, has a substantial impact on reducing query times. Bucket the long tail of dimensional values into ‘Other’ or roll up dimensions and create separate roll up dimension tables. Not only does this help further aggregate data on the fact table, it also helps limit the list of values displayed in quick filters.
An iterative approach is critical to the success of a thorough performance optimization exercise.
- Deploy the data model: Once the logical model is finalized, deploy the model in the database. Try various combinations of distributions, sort order, node types, and number of nodes. If needed, create a reduced dimension table from an existing huge dimension table – this helps in rolling up data in the fact tables, thereby, boosting performance; deploy aggregate tables to reduce “on the fly” aggregation by Tableau.
- Run dashboards against the data model: Identify longer-running queries and isolate them for further investigation. Also identify bottlenecks around Tableau server settings and concurrency on both Tableau and Redshift ends and try various options to identify the sweet spot.
- Optimize queries: Once you’ve identified the queries (in the previous step), they need to be scrutinized to determine the best way to be written. In many cases there are multiple ways to use Tableau to get the same results. Find the method that produces the most optimal queries. It’s key for the Data Architect and BI Architect to work closely together on understanding both optimizations and limitations. It’s also important to ensure that the tables the queries are going against are “vacuumed” and “analyzed” to update statistics. (Analyzed at a minimum if the table is a truncate and load.) Try aggregation in areas where query optimization isn’t possible.
- Baseline metrics: After performing a complete cycle of optimizations, baseline the metrics against single and multiple users. If numbers don’t look encouraging, then iterate again through the same approach—by considering aggregate tables, revisiting distribution style and sort order, revisiting queries, and so on.
To recap, focus on the following areas to get maximum performance benefits from such an ecosystem:
- Create a data model that thoroughly assesses business needs and architect the right foundation
- Deploy the model on your target database, keeping in mind the strengths and weaknesses of the backend
- Optimize the:
- Environment – Hardware configuration, concurrency settings, and data connection design
- Data structures – Physical aspects of the data model, query optimization, and aggregations
- Dashboard design – Front end design
Sujay Mimani is no longer with Slalom