Running Redshift Vacuum + Analyze through Airflow

Erik
3 min readFeb 4, 2025

--

Amazon Redshift is a powerful data warehouse, but it requires manual maintenance to keep performance high. Unlike some other managed databases, Redshift does not automatically reclaim disk space or update statistics fully for query optimization. To address this, regularly running VACUUM and ANALYZE is essential.

Why VACUUM and ANALYZE

  • VACUUM re-sorts table rows and reclaims unused disk space, improving performance and reducing storage costs
  • ANALYZE updates table statistics to help Redshift’s query planner make better execution decisions

At Sileon, we wanted to run VACUUM FULL+ANALYZE at least once every 24 hours for each database in our Redshift cluster. Initially, this seemed like a simple task: just schedule the commands using Airflow, our job orchestrator. However, we quickly ran into a major issue.

The Problem: Redshift doesn’t allow VACUUM in transactions

If you try to run VACUUM using a PostgresOperator or RedshiftOperator in Airflow, you’ll encounter this error:

What if you just put the commands in a stored procedure and call it every 24 hours? That would be convenient except for the fact that Redshift stored procedures are run as transactions so there goes that approach for the same reason, as you will still be faced with the error “VACUUM cannot run inside a transaction block”.

The Solution: Running VACUUM via a Shell Script

This lead us to start looking at linux-based utilities that could be ran from our AWS managed Apache Airflow (MWAA). There’s an AWS utility called amazon-redshift-utils but it has tons of features and dependences that we don’t need so it felt excessive for simple recurring maintenance. Instead we started looking at something more lightweight.

The result of this research was a linux-based shell script to run the commands using psql based on inspiration from TheDataGuy 2020. This script uses environment variables that are set in the Airflow DAG. The core of the Airflow DAG running this linux script basically looks like this:

redshift_conn_id = 'redshift_connection'
redshift_conn = BaseHook.get_connection(redshift_conn_id)

host = redshift_conn.host
port = redshift_conn.port
user = redshift_conn.login
password = redshift_conn.password
database = redshift_conn.schema

linux_command = f"""./vacuum-full-analyze-script.sh -h $REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_HOST -u $REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_USER -d $REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_DATABASE -P $REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_PASSWORD -p $REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_PORT """

print('running the following command: ' + linux_command)

command = f"""cd {vacuum_analyze_script_path}; \
cp -R {vacuum_analyze_script_path} /tmp; \
cd /tmp/redshift_utils; \
export REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_HOST={host}; \
export REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_PORT={port}; \
export REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_DATABASE={database}; \
export REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_USER={user}; \
export REDSHIFT_MAINTENANCE_SECRET_REDSHIFT_PASSWORD='{password}'; \
chmod +x ./vacuum-full-analyze-script.sh
{linux_command};
"""

# Run commands
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()

The linux script which performs the commands can be found here on github.

Additional DAG improvements

In addition to the core logic above we also added the following logic to our Airflow DAG:
- Error handling based on stdout and stderr content
- Different linux syntax in the “command” based on if the dag is running in MWAA (Prod/QA environments in MWAA) or MWAA Local Runner (Dev environment on local laptops)
- Looping through each Redshift database so that our prod instance of MWAA can perform periodic maintenance for Dev, QA and Prod databases

This approach is a fairly simple way to schedule recurring vacuum and analyze the same way we schedule all our other data jobs.

If you have any suggestions on more convenient ways to run Redshift Vacuum + Analyze from Airflow please reach out in the comment section.

--

--

Erik
Erik

No responses yet