Wednesday, October 30, 2024

What is Glue Job in AWS?

An AWS Glue job is a managed ETL (Extract, Transform, Load) job used to process data in AWS. AWS Glue makes it easy to discover, prepare, and integrate data from various sources for analytics, machine learning, and application development.





How AWS Glue Jobs Work

AWS Glue jobs let you process large datasets using Apache Spark or small tasks with Python Shell scripts. The main workflow includes:

  1. Data Extraction: Reading data from sources like Amazon S3, RDS, Redshift, etc.
  2. Data Transformation: Applying transformations to clean, enrich, or format the data.
  3. Data Loading: Writing the transformed data back to storage or analytical services.

Sample Glue Job Code

Below is an example of a Glue job script written in Python that reads data from an Amazon S3 bucket, applies a simple transformation, and writes the result back to another S3 bucket. This script uses the glueContext object, which is part of Glue’s Python API for Spark.


import sys

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext

from awsglue.dynamicframe import DynamicFrame


# Initialize Glue context

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()

glueContext = GlueContext(sc)

spark = glueContext.spark_session

job = Job(glueContext)

job.init(args['JOB_NAME'], args)


# Step 1: Read data from S3

source_data = glueContext.create_dynamic_frame.from_options(

    's3',

    {'paths': ['s3://source-bucket/path/to/data']},

    'json'

)


# Step 2: Apply transformation (Filter rows where 'age' > 30)

filtered_data = Filter.apply(frame=source_data, f=lambda row: row['age'] > 30)


# Step 3: Write transformed data back to S3

output = glueContext.write_dynamic_frame.from_options(

    frame=filtered_data,

    connection_type='s3',

    connection_options={'path': 's3://target-bucket/path/to/output'},

    format='parquet'

)


# Commit the job

job.commit()






Explanation of the Code

  1. Initialization: Sets up the Glue job context, which provides the Spark session and AWS Glue API.
  2. Data Extraction: Reads JSON data from the source S3 bucket into a DynamicFrame, which is a Glue-specific data structure for Spark.
  3. Transformation: Filters records to include only those where the age field is greater than 30.
  4. Data Loading: Writes the transformed data back to an S3 bucket in Parquet format, which is optimized for analytics.
  5. Commit: Completes the job.

Features of AWS Glue Jobs

  • Job Scheduling and Triggers: AWS Glue jobs can run on a schedule, on-demand, or based on events.
  • Serverless and Scalable: Glue jobs scale automatically with the volume of data and remove the need to manage infrastructure.
  • Data Catalog Integration: Glue jobs can leverage the Glue Data Catalog, a central repository for storing metadata about data sources.

AWS Glue jobs streamline data engineering tasks and are widely used in AWS-based data pipelines for data analytics and machine learning projects.


Learn more about Python here



No comments:

Post a Comment

Please do not enter any spam link in the comment box.