Article

Best practices for data transformation

Camille Lebrun

Overview

In today’s data-driven world, effective data transformation is essential for ensuring that your data is not only accurate but also consistent and ready for insightful analysis. Whether you’re developing dashboards, creating reports, or feeding machine learning models, a well-crafted data transformation strategy is key to deriving actionable insights and making informed decisions.

This guide explores best practices for data transformation and demonstrates how leveraging tools like dbt (data build tool) can significantly enhance your data workflows. By adhering to these practices, you can improve performance, streamline maintenance, and foster more effective collaboration within your data team.

Core Principles

Collaboration and Modern Stack

  • Version Control: Implement version control for all SQL code to facilitate reusability, traceability, and seamless maintenance. This practice allows team members to work together efficiently, track changes, and manage code versions effectively.
  • Modern Tools: Embrace cutting-edge data tools such as Google BigQuery for advanced SQL querying and dbt for robust ELT (as code) processes. These tools are designed to enhance performance, scalability, and overall efficiency, enabling you to handle large datasets with ease.

Query Optimization

  • BigQuery Functions: Leverage the powerful functions offered by BigQuery to optimize query performance. Utilizing built-in functions and best practices can significantly reduce processing time and resource consumption.
  • Minimize Data Processing: Optimize your queries by selecting only the fields you need and avoiding unnecessary data processing. This approach not only improves query performance but also reduces costs associated with data handling and storage.

We use Jaffle Shop dataset for the examples in this guide and other simple schema.

Best Practices

Use qualify clause

Utilize the qualify clause to filter the results of window functions, avoiding nested subqueries for cleaner, more readable queries.

Good practice example:

select
order_id,
customer_id,
order_total
from {{ ref('orders') }}
qualify row_number() over (partition by customer_id order by order_date desc) = 1

In this example :

  • Direct query: This query directly references the orders table (or a similar data source through the {{ ref(‘orders’) }} macro in dbt).
  • Row number function: The row_number() window function is used to assign a unique sequential integer to rows within a partition of customer_id, ordered by ordered_at in descending order. This ensures the most recent order gets the number 1.
  • qualify clause: The qualify clause filters the results of the window function. It selects only the rows where row_number() = 1, effectively returning the latest order for each customer.
  • Efficiency: By using the qualify clause, the query avoids the need for a common table expression (CTE), reducing complexity and potentially improving performance, as the filtering is done in one step.

Example to avoid:

with orders as (
select
order_id,
customer_id,
order_total,
row_number() over (partition by customer_id order by order_date desc) as rk
from {{ ref('orders') }}
)

select * from orders
where rk = 1

In this example :

  • Common Table Expression (CTE): This query first creates a CTE named cte_orders. The CTE includes all columns from the orders table and computes the row number (rk) for each row within partitions of customer_id, ordered by ordered_at in descending order.
  • Additional Step: The main query then selects from this CTE and filters rows where rk = 1, meaning only the most recent orders per customer are included.
  • Redundancy: Using a CTE in this context adds an extra step to the query process, making it slightly more complex and potentially less efficient, as it requires scanning and materializing the CTE before filtering.
  • Potential Performance Impact: While the use of CTEs is not inherently bad, in this scenario, the qualify clause in the Good Practice Example provides a more streamlined approach by eliminating the intermediate CTE, which could be more efficient.

Leveraging the QUALIFY clause with window functions results in cleaner, more efficient queries by avoiding unnecessary intermediate steps and enhancing readability.

Import only required fields

Always select only the fields you need. This practice reduces the amount of data processed and speeds up query execution.

Good practice example

with
cte_customers as (

select
customer_id,
customer_name
from {{ ref('customers') }}

),

cte_orders as (

select
customer_id,
order_total,
ordered_at
from {{ ref('orders') }}

)

select
customers.customer_name,
orders.order_total,
orders.ordered_at
from cte_customers as customers
left join cte_orders as orders on customers.customer_id = orders.customer_id

In this example :

  • We use two CTEs (cte_customers and cte_orders) to select only the necessary fields.
  • The cte_customers CTE selects customer_id, customer_name from the customers table.
  • The cte_orders CTE selects customer_id, order_total and ordered_at from the orders table.
  • The final query joins these two CTEs on customer_id.
Always select only the fields you need to optimize query performance and data handling.

Example to avoid:

with
cte_customers as (
select * from {{ ref('customers') }}
),

cte_orders as (
select * from {{ ref('orders') }}
)

select
customers.customer_name,
orders.order_total,
orders.ordered_at
from cte_customers as customers
left join cte_orders as orders on customers.customer_id = orders.customer_id

In this example :

  • Unnecessary import: Both CTEs use SELECT *, importing all fields from the customers and orders, even though only a few fields are needed.
  • Performance: This leads to unnecessary data processing and can slow down query performance.

Avoid Subqueries

CTEs improve the readability and maintainability of your SQL code. They simplify complex queries and make them easier to understand by breaking them into logical, reusable components.

Good practice example

with
cte_customers as (

select
customer_id,
customer_name
from {{ ref('customers') }}

),

cte_orders as (

select
customer_id,
order_total,
ordered_at
from {{ ref('orders') }}

)

select
customers.customer_name,
orders.order_total,
orders.ordered_at
from cte_customers as customers
left join cte_orders as orders on customers.customer_id = orders.customer_id

In this example :

  • The cte_customers CTE isolates customer information.
  • The cte_orders CTE isolates the orders data relevant to the query.
  • The final SELECT statement joins these CTEs and applies a filter to get the desired results.

Example to avoid:

select
customers.customer_name,
orders.order_total,
orders.ordered_at
from
(select
customer_id,
customer_name
from {{ ref('customers') }}) as customers
left join
(select
customer_id,
order_total,
ordered_at
from {{ ref('orders') }}) as orders
on
customers.customer_id = orders.customer_id

In this example:

  • The query uses nested subqueries to achieve the same result.
  • This approach can be harder to read and maintain, especially as queries become more complex.
  • It’s more difficult to reuse or debug individual parts of the query.

Using CTEs instead of nested subqueries enhances the readability and maintainability of SQL code. CTEs allow you to break down complex queries into manageable pieces, making them easier to understand and modify. Always aim to use CTEs for clarity and better structure in your SQL queries.

Avoid RIGHT JOIN

Replace right joins with left joins to maintain consistency and readability in your queries. Right joins can often be replaced with left joins, which are more intuitive and widely used in SQL queries.

Good practice example

Using a left join to maintain consistency and readability:

with cte_orders_items as (
select
order_id,
product_id
from {{ ref('stg_order_items') }}
),

cte_orders as (
select
order_id,
ordered_at,
order_total
from {{ ref('stg_orders') }}
),

cte_products as (
select * from {{ ref('stg_products') }}
)

select
orders.order_id,
orders.ordered_at,
products.product_name,
products.product_price
from cte_orders_items as order_items
left join cte_orders as orders on order_items.order_id = orders.order_id
left join cte_products as products on order_items.product_id = products.product_id
  • A left join is used to join cte_orders_items with cte_orders on the order_id field and cte_products on the product_id field.
  • This query retrieves all records from cte_orders_items and the matching records from cte_orders and cte_products. If there is no match, NULL values are returned for fields from cte_orders and cte_products.
  • Using a left join ensures that all rows from the left table (cte_orders_items) are included, which is often the intended logic when using joins.

Example to avoid

with cte_orders_items as (
select
order_id,
product_id
from {{ ref('stg_order_items') }}
),

cte_orders as (
select
order_id,
ordered_at,
order_total
from {{ ref('stg_orders') }}
),

cte_products as (
select * from {{ ref('stg_products') }}
)

select
orders.order_id,
orders.ordered_at,
products.product_name,
products.product_price
from cte_orders_items as order_items
right join cte_orders as orders on order_items.order_id = orders.order_id
right join cte_products as products on order_items.product_id = products.product_id
  • A right join is used to join cte_orders_items with cte_orders on the order_id field and cte_products on the product_id field.
  • This query retrieves all records from cte_orders and cte_products and the matching records from cte_orders_items. If there is no match, NULL values are returned for fields from cte_orders_items.

Replacing right joins with left joins enhances the consistency and readability of your SQL queries. Left joins are more intuitive and commonly used, making it easier to understand and maintain your queries. Whenever possible, opt for left joins to simplify your query logic and improve clarity.

Avoid Using Keywords for Field Names

Using SQL keywords as field names can lead to confusion and errors in your queries. It’s best to avoid naming fields with reserved keywords such as date, version, order, version. Instead, use more descriptive and unique names to ensure clarity and avoid potential conflicts.

Good practice example

Using descriptive and non-conflicting field names:

with cte_sales as (
select
sale_id,
date as transaction_date,
sale_amount
from sales_table
where sale_date >= '2023–01–01'
),

cte_customers as (
select
customer_id,
customer_name
from customers_table
)

select
s.sale_id,
s.transaction_date,
s.sale_amount,
c.customer_name
from cte_sales as s
inner join cte_customers as c on s.customer_id = c.customer_id
where s.sale_amount > 100

In this example:

  • The field date is renamed to transaction_date to avoid using the keyword date.
  • Descriptive names like transaction_date and sale_amount are used to enhance clarity and prevent ambiguity.

Example to avoid

Using SQL keywords as field names:

with cte_sales as (
select
sale_id,
date,
amount as sale_amount
from sales_table
where date >= '2023–01–01'
),

cte_customers as (
select
id as customer_id,
name as customer_name
from customers_table
)

select
s.sale_id,
s.date,
s.sale_amount,
c.customer_name
from cte_sales as s
inner join cte_customers as c on s.customer_id = c.customer_id
where s.sale_amount > 100

In this example:

  • The field names date and amount use SQL keywords, which can lead to potential issues or confusion.
  • Using reserved keywords as field names may cause syntax errors or ambiguity, particularly if the SQL parser interprets these names differently.

Avoid using SQL keywords as field names to prevent confusion and potential errors in your queries. Opt for descriptive, non-conflicting names that clearly convey the purpose of each field. This practice helps maintain clarity and ensures that your SQL code is robust and error-free.

Use exact matches instead of LIKE when appropriate

When checking for exact values, use = instead of LIKE for better performance and clarity. The LIKE operator is more suited for pattern matching with wildcard characters. If you are looking for an exact match, = is more efficient and makes your intent clearer.

Good practice example

Using exact matches for clarity and performance:

with cte_events as (
select
event_id,
event_type,
event_date
from {{ ref('events') }}
where event_date >= '2024–01–01'
)

select
event_id,
event_type,
event_date
from cte_events
where event_type = 'connection'

In this example:

  • The WHERE clause uses = to check for an exact match of event_type.
  • This approach is clear and efficient, as it directly compares the field to the specified value without additional pattern matching overhead.

Example to avoid

with cte_events as (
select
event_id,
event_type,
event_date
from {{ ref('events') }}
where event_date >= '2024–01–01'
)

select
event_id,
event_type,
event_date
from cte_events
where event_type like 'connection'

For exact value matches, use the = operator instead of LIKE to improve query performance and clarity. LIKE should be reserved for scenarios where pattern matching is required, such as searching with wildcards. Employing = for exact matches ensures your queries are more efficient and your intentions are clearly communicated.

Tools and Configurations

Linter

Use a linter with a common configuration to enforce coding standards and ensure consistency across the team such as SQLFluff

BigQuery Anti-Pattern Tool

Utilize BigQuery anti-pattern tools to identify and correct inefficient query patterns, further optimizing performance.

Conclusion

By following these best practices, we ensure that our BigQuery queries are efficient, maintainable, and scalable. Adhering to these guidelines helps us work collaboratively and effectively, maximizing the potential of BigQuery’s capabilities.

Follow me for more advice about dbt, data engineering, and explore my portfolio!

I’m Camille, currently an apprentice at an e-Retail company as an Analytics Engineer. I’m passionate about data engineering and always eager to share my knowledge and experiences.

Enjoy the deep dive? Clap on Medium to back the research and keep these technical notes visible for analytics engineers who need them.

Contact me