ETL with AWS Glue

ruchi khanuja
2 min readMay 21, 2021

Are you trying to setup ETL using AWS Glue?AWS Glue is ETL service provided by AWS.It is serverless, so you just pay for the resources used and not for the infra.Before you start with AWS Glue here are some of components that one should be aware of in order to setup ETL process using AWS Glue:

  1. Data Store → Source of data to be used in the Glue ETL process for e.g. AWS S3
  2. Crawler → Glue Crawler fetches schema information, other metadata from the various data sources.
  3. Classifiers → Recognizes the format of the data for e.g. csv, json, parquet, avro, amazon redshift, mysql etc
  4. Data Catalog → It is populated by Crawler with metadata of data source, database, table definition, column name and their data types.This also stored Job definitions.
  5. Job → Job is defined in the ETL section of Glue,To transform data from source and load in the target.Glue automatically generates scripts(python, spark) to do this, or can be written by the developer. Developer can also define the mapping between source and target columns.
  6. Scheduling → Jobs in Glue can be scheduled via various methods like time based trigger or triggered via an event.

Once you are familiar with all the components of AWS Glue, take a look at some steps described below on high level to load data from s3 to Redshift with basic transformations:

  1. Add Classifier if required, for data format e.g. CSV in this case.
  2. Create a Glue Crawler that fetches schema information from source which is s3 in this case.
  3. Run Glue Crawler from step 2, to create database and table underneath to represent source(s3)
  4. Create connection pointing to Redshift, select the Redshift cluster and DB that is already configured beforehand, Redshift is the target in this case.
  5. Create another Glue Crawler that fetches schema information from the target which is Redshift in this case.While creating the Crawler Choose the Redshift connection defined in step 4, and provide table info/pattern from Redshift.
  6. Run Glue Crawler created in step 5 that represents target(Redshift)
  7. Create a Glue Job in the ETL section of Glue,To transform data from source and load in the target.Choose source table and target table created in step1-step6. Glue automatically generates scripts(python, spark) to do ETL, or can be written/edited by the developer. Developer can also define the mapping between source and target columns.Here developer can change the data type of the columns, or add additional columns.
  8. Click on save job and edit script, it will take you to a console where developer can edit the script automatically generated by AWS Glue.
  9. Save and Run the job to execute the ETL process between s3 and Redshift.

Sample Glue script code can be found here:

https://github.com/aws-samples/aws-glue-samples

--

--