Big Data Pitfalls for High-Growth Startus

As high-growth startups experience rapid customer growth, data volumes and complexity are compounded exponentially. At this stage, companies are aggregating more information on their customers and operations, and the flaws of organically-developed ETL (Extract, Transform, and Load) pipelines and data warehouse schemas start presenting real problems, such as unbearable query times, stale data, and skyrocketing infrastructure costs. The following are some of the common pitfalls I’ve observed that lead to these issues.

1. OLTP != OLAP

There are primarily two database use cases, and there are different databases optimized for each.

Online Transactional Processing (OLTP) is the where a database is primarily used for transactional operations (think application databases), such as performing performing primary key lookups, or doing simple filters and joins to return data to end users. Data is stored row-wise on disk, and it’s very efficient to looking up particular objects or find lists of objects. These are the technologies we know and love, such as PostgreSQL and MySQL.

Online Analytics Processing (OLAP), on the other hand, is where a database is used primarily for analytical use cases, such as performing aggregations (SUM, AVG, GROUP, SORT, etc.) on extremely large datasets. Data is stored column-wise on disk, which enables (1) more efficient aggregations and calculations on a particular column, and (2) increased data compression to reduce storage costs. Many of the databases used for OLAP are considered data warehouses, and include technologies such as RedShift, HBase, Snowflake, and Greenplum.

This is probably the biggest source of trouble for startups. In the early stages of creating an ETL pipeline, many companies simply dump their transactional data schema into a data warehouse, thinking OLAP databases are a magic bullet. However, OLAP databases require very different schemas to operate efficiently. These schemas are commonly known as star and snowflake schemas, which take advantage of a warehouse’s columnar based storage to improve query times for analytics use cases. Absent an analytics-based schema, companies can experience degraded query performance on an analytics database compared to running the same query on a transactional database.

The other pitfall is failing to optimize the data warehouse specifically for the uses cases of various departments within the organization. For example, the marketing team may be running very different queries from the engineering or finance teams. This is where the concept of “data marts” come into play, which are various schemas optimized for each department within an organization. Data marts help reduce query times for specific analytics use cases, and they also prevent departments from running into each other and dominating warehouse cluster resources.

2. Unnecessary and Large Batch Processing

When volumes were small, it might have been feasible to run a nightly ETL job to aggregate data from various sources and dump everything into a data warehouse. I’ve even observed companies tear down the entire warehouse, migrate ALL of their data back again, and hope the job is finished before people start work the next morning.

But as data volumes grow, heavy batch jobs take longer and longer to complete, resulting in stale data and increased infrastructure costs. Companies at this stage need to consider micro-batch and streaming approaches to process and store data in small increments over regular time periods periods throughout a 24 hour cycle. Although there is added complexity, this approach enables you to only process deltas between the source and destination. This results in more real-time loading and analysis of data, and avoids long and costly batch jobs.

3. The Devil’s In the Details

Finally, as a company begins to get serious about its data pipelines, they often underestimate the challenges and nuances of the job. Data engineering can be extremely tedious when it comes to processing and cleaning data, and companies new to the task may underestimate the time it takes to create ETL pipelines. Additionally it can be difficult to reach the desired level of performance, security, and cost optimization without considering the granular details of your data, and the underlying storage and processing technologies. Some of the details to keep in mind are:

Topology: Properly sorting and distributing data across nodes in the data warehouse, and collocating tables that are frequently joined, to optimize query performance. This can be a rather labor intensive process, and requires a deep understanding of the underlying database technologies.

Compression: Choosing the right compression algorithms to optimize storage costs and query performance.

Benchmarking: Regularly experimenting and measuring the performance of different schemas and technologies to achieve the best results.

Maintenance: Performing regular maintenance operations to catalog, re-sort, and re-distribute the data in a data warehouse.

Workload Management: Allocating the right amount of compute and memory resources to various users and groups - to prevent any one user or group from dominating a storage or processing cluster.

Query Optimization: Determining the proper order of operations in a query to reduce latency and resource requirements.

Security: Securing data in transit and at rest according to industry and compliance standards.

Lifecycles: Transitioning data to different storage media depending on age and frequency of access to reduce storage costs and meet data retention requirements.

Conclusion

This article should help you better avoid the common pitfalls companies face as data volumes and complexity continues to grow. Remember to:

  1. Optimize data schemas for transactional versus analytical processing

  2. Leverage micro-batch and streaming approaches when possible to improve liveness and cost optimization.

  3. Get familiar with the details underlying big data technology to reap the full benefits this ecosystem of tools has to offer. Or, get an expert involved!

I’ve mainly focused on data warehousing and analytics queries, but what other pitfalls have you observed? Be sure to add to the list in the comments!

Finally, if the problems I’ve mentioned accurately describe your organization, feel free to get in touch with me, and I’d be happy to help.