Python, Software Development, UX and Product Design - Blog - STX Next

AWS Glue Studio Guide—How to Build Data Pipelines Without Writing Code

Written by Lidia Kurasińska | Oct 29, 2021 4:49:56 PM

You’ve probably heard that creating ETL (extract, transform, load) pipelines, especially complex ones, is a complicated task. Various tools have been developed to make this process much easier, but most of them still require some knowledge of a programming language (for example, Python or R) often combined with an understanding of tools such as Spark.

In August 2017, AWS created Glue DataBrew, a tool perfect for data and business analysts, since it facilitates data preparation and profiling. A year ago, the company released AWS Glue Studio, a visual tool to create, run, and monitor Glue ETL Jobs.

AWS Glue Studio supports various types of data sources, such as S3, Glue Data Catalog, Amazon Redshift, RDS, MySQL, PostgreSQL, or even streaming services, including Kinesis and Kafka. Out of the box, it offers many transformations, for instance ApplyMapping, SelectFields, DropFields, Filter, FillMissingValues, SparkSQL, among many. We can save the results of our jobs to Amazon S3 and tables defined in the AWS Glue Data Catalog.

Also, apparently, we can use it all without knowing Spark, as Glue Studio will generate Apache Spark code for us.

So, let’s see in practice what we can do with AWS Glue Studio. I promised myself that I would try not to write a single line of code when solving my case.

In this article, I used a slightly modified E-Commerce Data dataset.