Welcome to the Spar Nord Bank ATM Transactions Analysis Project! This project demonstrates the development of a batch ETL pipeline using widely adopted tools and technologies such as Apache Sqoop, Apache PySpark, Amazon S3, and Amazon Redshift. The objective of this project is to extract, transform, and load (ETL) transactional data from a MySQL RDS database into Amazon Redshift, followed by performing analytical queries to derive valuable insights.
Spar Nord Bank needs to optimize its ATM refilling strategy by analyzing ATM usage patterns, particularly withdrawals, and their influencing factors such as weather, time, and location. Additionally, the bank seeks to gain insights into ATM failures and transaction behaviors to enhance overall ATM management and customer service.
The project involves the following steps:
- Data Extraction: Extract transactional data from MySQL RDS using Apache Sqoop and load it into HDFS.
- Data Transformation: Utilize Apache PySpark to transform the extracted data into a format compatible with the target schema, which includes creating dimension and fact tables.
- Data Loading: Load the transformed data from HDFS to Amazon S3.
- Redshift Setup: Create a Redshift cluster, define the schema, and load the data from S3 to Redshift tables.
- Data Analysis: Perform analytical queries on the loaded data to answer business questions and derive insights.
- Apache Sqoop: For data ingestion from MySQL RDS to HDFS.
- Apache PySpark: For data transformation and creating dimension and fact tables.
- Amazon S3: For storing the transformed data before loading it into Redshift.
- Amazon Redshift: For setting up a data warehouse and running analytical queries.
The dataset used in this project was sourced from Kaggle and contains detailed ATM transactional data along with weather information at the time of the transactions from around 113 ATMs across Denmark for the year 2017. The dataset comprises approximately 2.5 million records and includes various fields such as transaction date and time, ATM status, ATM details, weather conditions, transaction details, and more.
The data model consists of four dimension tables and one fact table:
- ATM Dimension: Contains ATM-related data including ATM ID, manufacturer, and location reference.
- Location Dimension: Contains location data such as city, street name, street number, zip code, latitude, and longitude.
- Date Dimension: Contains time-related data including timestamp, year, month, day, hour, and weekday.
- Card Type Dimension: Contains information about different card types used in transactions.
- Transaction Fact: Contains numerical and transactional data such as currency, service type, transaction amount, message codes, and weather information.
The approach for this project is divided into the following stages:
sqoop import \
--connect jdbc:mysql://upgraddetest.cyaielc9bmnf.us-east-1.rds.amazonaws.com/testdatabase \
--table SRC_ATM_TRANS \
--username student --password STUDENT123 \
--target-dir /user/ec2-user/ETL_Project/data \
-m 1hadoop fs -ls /user/ec2-user/ETL_Project/dataPySpark-ETL.ipynb is used here and the points are summarized below:
- Define the input schema using StructType to ensure correct data types.
- Read and verify the data from HDFS.
- Create and clean dimension tables by removing duplicates and ensuring proper primary keys.
- Create the transaction fact table by joining with the dimension tables and cleaning the data.
A Redshift cluster is created, and tables are defined according to the target schema. Data is then copied from the S3 bucket into the respective Redshift tables.
Setting up a database in the RedShift cluster and running queries to create the dimension and fact tables
-
Creating Schema
create schema etlschema;
-
Creating Location Dimension Table
create table etlschema.dim_location( location_id int, location varchar(50), streetname varchar(255), street_number int, zipcode int, lat decimal, lon decimal, primary key (location_id) );
-
Creating Card-Type Dimension Table
create table etlschema.dim_card_type( card_type_id int, card_type varchar(50), primary key (card_type_id) );
-
Creating Date Dimension Table
create table etlschema.dim_date( date_id int, full_date_time timestamp, year int, month varchar(50), day int, hour int, weekday varchar(50), primary key (date_id) );
-
Creating ATM Dimension Table
create table etlschema.dim_atm( atm_id int, atm_number varchar(20), atm_manufacturer varchar(30), atm_location_id int, primary key(atm_id), foreign key(atm_location_id) references etlschema.dim_location(location_id) );
-
Creating the Fact ATM Transaction Table
create table etlschema.fact_atm_trans( trans_id bigint, atm_id int, weather_loc_id int, date_id int, card_type_id int, atm_status varchar(50), currency varchar(20), service varchar(50), transaction_amount int, message_code varchar(50), message_text varchar(100), rain_3h decimal(10,3), clouds_all int, weather_id int, weather_main varchar(50), weather_description varchar(255), primary key(trans_id), foreign key(weather_loc_id) references etlschema.dim_location(location_id), foreign key(atm_id) references etlschema.dim_atm(atm_id), foreign key(date_id) references etlschema.dim_date(date_id), foreign key(card_type_id) references etlschema.dim_card_type(card_type_id) );
-
Location Dimension Table
copy etlschema.dim_location( location_id,location,streetname,street_number,zipcode,lat,lon ) from 's3://shnkreddy/dim-location/part-00000-eb7cc902-17c3-4895-9b11-9c6a5b0390ce-c000.csv' iam_role 'arn:aws:iam::595818034157:role/redshift_s3_fullaccess' delimiter ',' region 'us-east-1' CSV;
-
ATM Dimension Table
copy etlschema.dim_atm from 's3://shnkreddy/dim-atm/part-00000-2c7564c6-b985-499c-97d0-936c1bd81da9-c000.csv' iam_role 'arn:aws:iam::595818034157:role/redshift_s3_fullaccess' delimiter ',' region 'us-east-1' CSV;
-
Card Type Dimension Table
copy etlschema.dim_card_type from 's3://shnkreddy/dim-card-type/part-00000-81965790-3bf7-4eb6-bdc1-37002762c0d0-c000.csv' iam_role 'arn:aws:iam::595818034157:role/redshift_s3_fullaccess' delimiter ',' region 'us-east-1' CSV;
-
Date Dimension Table
copy etlschema.dim_date from 's3://shnkreddy/dim-date/part-00000-e64864de-8000-4129-8a59-2059bb5f8b01-c000.csv' iam_role 'arn:aws:iam::595818034157:role/redshift_s3_fullaccess' delimiter ',' region 'us-east-1' TIMEFORMAT AS 'YYYYMMDDHHMISS' CSV;
-
Fact Table
copy etlschema.fact_atm_trans from 's3://shnkreddy/fact-table/part-00000-954a52c1-346a-4f62-adf2-1645d650c796-c000.csv' iam_role 'arn:aws:iam::595818034157:role/redshift_s3_fullaccess' delimiter ',' region 'us-east-1' CSV;
- Top 10 ATMs with the most inactive transactions.
Query:
select a.atm_number, a.atm_manufacturer, l.location, count(a.atm_number) as total_transaction_count,
sum(case when atm_status = 'Inactive' then 1 else 0 end) as inactive_count,
round((inactive_count * 100.00 / total_transaction_count), 4) as inactive_count_percent
from etlschema.dim_atm a, etlschema.dim_location l, etlschema.fact_atm_trans f
where f.atm_id = a.atm_id and f.weather_loc_id=l.location_id and f.atm_status = 'Inactive'
group by a.atm_number, a.atm_manufacturer, l.location
order by inactive_count DESC limit 10;Screenshot of the resultant table:
-
Number of ATM failures corresponding to the different weather conditions recorded at the time of the transactions
Query:
select weather_main, count(trans_id) as total_transaction_count, sum(case when atm_status = 'Inactive' then 1 else 0 end) as inactive_count, round((inactive_count * 100.00 / total_transaction_count), 4) as inactive_count_percent from etlschema.fact_atm_trans where ascii(weather_main) != 0 group by weather_main order by inactive_count_percent DESC;
-
Top 10 ATMs with the most number of transactions throughout the year
Query:
select a.atm_number, a.atm_manufacturer, l.location, count(a.atm_number) as total_transaction_count from etlschema.dim_atm a, etlschema.dim_location l, etlschema.fact_atm_trans f where f.atm_id = a.atm_id and f.weather_loc_id=l.location_id group by a.atm_number, a.atm_manufacturer, l.location order by total_transaction_count DESC limit 10;
-
Number of overall ATM transactions going inactive per month for each month
Query:
select d.year, d.month, count(f.trans_id) as total_transaction_count, sum(case when f.atm_status = 'Inactive' then 1 else 0 end) as inactive_count, round((inactive_count * 100.00 / total_transaction_count), 4) as inactive_count_percent from etlschema.fact_atm_trans f, etlschema.dim_date d where f.date_id=d.date_id group by d.month, d.year order by d.month;
-
Top 10 ATMs with the highest total withdrawn amount throughout the year
Query:
select a.atm_number, a.atm_manufacturer, l.location, sum(f.transaction_amount) as total_transaction_amount from etlschema.dim_atm a, etlschema.dim_location l, etlschema.fact_atm_trans f where f.atm_id = a.atm_id and f.weather_loc_id=l.location_id group by a.atm_number, a.atm_manufacturer, l.location order by total_transaction_amount DESC limit 10;
-
Number of failed ATM transactions across various card types
Query:
select c.card_type, count(f.trans_id) as total_transaction_count, sum(case when f.atm_status = 'Inactive' then 1 else 0 end) as inactive_count, round((inactive_count * 100.00 / total_transaction_count), 4) as inactive_count_percent from etlschema.fact_atm_trans f, etlschema.dim_card_type c where f.card_type_id=c.card_type_id group by c.card_type order by inactive_count_percent desc;
-
Number of transactions happening on an ATM on weekdays and on weekends throughout the year. Order this by the ATM_number, ATM_manufacturer, location, weekend_flag and then total_transaction_count
Query:
select a.atm_number, a.atm_manufacturer, l.location, case when d.weekday='Saturday' then 1 when d.weekday='Sunday' then 1 else 0 end as weekend_flag, count(a.atm_number) as total_transaction_count from etlschema.dim_atm a, etlschema.dim_location l, etlschema.fact_atm_trans f, etlschema.dim_date d where f.atm_id = a.atm_id and f.weather_loc_id=l.location_id and f.date_id=d.date_id group by a.atm_number, a.atm_manufacturer, l.location, weekend_flag order by a.atm_number limit 10;
-
Most active day in each ATMs from location "Vejgaard"
Query:
select a.atm_number, a.atm_manufacturer, l.location, d.weekday, count(a.atm_number) as total_transaction_count from etlschema.dim_atm a, etlschema.dim_location l, etlschema.fact_atm_trans f, etlschema.dim_date d where f.atm_id = a.atm_id and f.weather_loc_id=l.location_id and f.date_id=d.date_id and l.location = 'Vejgaard' group by a.atm_number, a.atm_manufacturer, l.location, d.weekday order by d.weekday, total_transaction_count limit 2;
This project provides a comprehensive ETL pipeline to process and analyze ATM transactional data, helping Spar Nord Bank optimize ATM management and gain valuable insights into ATM usage patterns. The approach demonstrates effective use of modern data engineering tools and cloud services to solve real-world business problems.














