An AWS Glue job is a managed ETL (Extract, Transform, Load) job used to process data in AWS. AWS Glue makes it easy to discover, prepare, and integrate data from various sources for analytics, machine learning, and application development.
How AWS Glue Jobs Work
AWS Glue jobs let you process large datasets using Apache Spark or small tasks with Python Shell scripts. The main workflow includes:
- Data Extraction: Reading data from sources like Amazon S3, RDS, Redshift, etc.
- Data Transformation: Applying transformations to clean, enrich, or format the data.
- Data Loading: Writing the transformed data back to storage or analytical services.
Sample Glue Job Code
Below is an example of a Glue job script written in Python that reads data from an Amazon S3 bucket, applies a simple transformation, and writes the result back to another S3 bucket. This script uses the glueContext
object, which is part of Glue’s Python API for Spark.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Step 1: Read data from S3
source_data = glueContext.create_dynamic_frame.from_options(
's3',
{'paths': ['s3://source-bucket/path/to/data']},
'json'
)
# Step 2: Apply transformation (Filter rows where 'age' > 30)
filtered_data = Filter.apply(frame=source_data, f=lambda row: row['age'] > 30)
# Step 3: Write transformed data back to S3
output = glueContext.write_dynamic_frame.from_options(
frame=filtered_data,
connection_type='s3',
connection_options={'path': 's3://target-bucket/path/to/output'},
format='parquet'
)
# Commit the job
job.commit()
Explanation of the Code
- Initialization: Sets up the Glue job context, which provides the Spark session and AWS Glue API.
- Data Extraction: Reads JSON data from the source S3 bucket into a
DynamicFrame
, which is a Glue-specific data structure for Spark. - Transformation: Filters records to include only those where the
age
field is greater than 30. - Data Loading: Writes the transformed data back to an S3 bucket in Parquet format, which is optimized for analytics.
- Commit: Completes the job.
Features of AWS Glue Jobs
- Job Scheduling and Triggers: AWS Glue jobs can run on a schedule, on-demand, or based on events.
- Serverless and Scalable: Glue jobs scale automatically with the volume of data and remove the need to manage infrastructure.
- Data Catalog Integration: Glue jobs can leverage the Glue Data Catalog, a central repository for storing metadata about data sources.
AWS Glue jobs streamline data engineering tasks and are widely used in AWS-based data pipelines for data analytics and machine learning projects.
Learn more about Python here