What Do I Do As A Data Engineer?
Article • 1,761 Words • Software Engineering/Career • 01/12/2024
Pages That Link Here:
What is a Data Engineer?
From The Pragmatic Engineer: “Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.”
Data pipelines are at the heart of any modern data infrastructure, and are what most Data Engineers spend their time working on in some kind of capacity. A data pipeline is a set an orchestrated set of processes that takes data from ingestion to transformation to storage. One of the most important parts of a data pipeline process is the ETL process. ETL stands for Extract, Transform, Load, and is a way of combining data from multiple sources to prepare it from its raw state to something more usable for business analytics or machine learning.
I did not go into work expecting to be a Data Engineer, and instead kind of fell into it. I went to school for Computer Science and all of my internships were related to Software Engineering. To be fair, Data Engineering really is just a specialization of Software Engineering, but it is a lot of new terms and processes that I wasn’t able to gain exposure to through prior work or school. I think that Data Engineering is still a burgeoning field so I wanted to write a little bit about my experiences to demystify the line of work. I do want to preface this by saying that all Data Engineers will be doing something different, as the business needs and infrastructure of each company will be very different, this is all just my personal experience.
Project #1: Lambda Data Pipelines
This project was for updating two AWS Lambda data pipelines that were used for data reporting. As a part of an enterprise mandate, we had to update these data pipelines to get them to load data into a more permanent location in our data warehouse/data lake. I was in charge of implementing this migration, so I created a library to pull the historical data from our temporary SQL database tables and upload it to our data warehouse/lake using our internal data platform. Then I also updated the Python code for the Lambdas to use the same data platform so it would upload to the same locations in the data warehouse and data lake.
As an aside, I don’t think that Lambdas are the best way to develop a data pipeline, but with that being said, I do think that Lambdas do have a place in effective data pipelines.
Project #2: Internal Data Platform
This project was a data platform that was meant as a place for downstream consumers to come for high-quality, standardized, centralized data. We were drawing from a lot of various internal data sources and standardizing it and allowing people to consume it from a singular location. I really enjoyed working on this project because we were working on something that did not exist before. Additionally, since it was a platform, there wasn’t just technical work, but work in creating people processes on how to contribute and use the platform. I felt like I had a lot of freedom on what to work on within the platform and that my opinions were taken seriously even though I was relatively new to the company.
Airflow Data Pipelining
This data platform was made up of various data pipelines orchestrated by Airflow and the data would all end up in both our data warehouse and data lake for consumption. I worked a bit directly on the data pipelines by writing DAGs, Operators, and Sensors, as well as deploying some AWS infrastructure for the Airflow instance, but my main work was around managing our data models and dataset registrations for our internal data catalog.
Data Modeling and Dataset Registration
Our data was modeled to be stored across various data zones, which is a common architecture for data lakes. For each dataset, we would make certain modifications so that the dataset schema would be different in each of the zones, as some fields would be dropped or added. I initially had to register those datasets manually, but it quickly became overwhelming, especially as the number of data pipelines we were managing increased. In order to solve this problem, I wrote a few libraries to tackle a few problems.
My first library was written in Python for dataset schema generation. For our data catalog, you could register a dataset’s schema either by using the UI or by uploading an AVRO schema file, a format created by Apache. We were modeling our data in Excel to keep track of the various datasets the how the field names changed over the various data zones. This meant that my library had to be able to parse the Excel file and generate AVRO schema from there. One of our Distinguished Engineers wanted us to be able to create ER Diagrams to show consumers how our data connects with each other, so I added ER Diagram Table Entity Generation to my schema generation script, thus making it a more holistic data modeling library.
From there, the team that managed the data catalog released a REST API to register datasets. So my next library was about taking that generated schema file and combining it with dataset metadata so that I could automatically register datasets. This was a bit of a pain because it was not as simple as uploading an AVRO file and dataset metadata with the API, it expected the schema in a modified AVRO format, so I had to make a few tweaks, but nothing major.
Dataset Schema Validation
When I was registering datasets manually and automatically there was the occasional occurrence that I would get an error saying that my dataset schema was invalid. The only problem is that it wouldn’t really give me any kind of specific details, it just would say “Invalid Schema” and leave me to own my devices to figure out why it wasn’t taking it. This led me to create a web application to do AVRO Schema Validation, which would check against the AVRO Schema Specification, as well as the specific rules of our data catalog. By recursively checking the AVRO Schema and keeping track of the JSON pointers of the values I was checking, I was able to perform validations with custom error messages and line numbers of errors without writing an overly complicated parser. I would take this error data and combine it with an in-browser code editor to show the error messages in-line and suggest ways to fix the error, and even providing a button to automatically remediate the error if it was an easy fix.
Project #3: ETL and Data Quality Testing
This project was a partnership account migration/data conversion. I was involved with the ETL process that would take the source to target mapping and implement it in Scala/Spark. Since I joined this project a bit late into the process, my main focus was testing the implementation and output of the ETL process, rather than implementing the ETL itself.
ETL Testing
In ETL testing there are a number of things that you can test, but I was mainly focused on data transformation implementation fidelity, or in other words, if the data is being transformed according to the mapping logic. There are two main ways that this can happen: white box and black box testing. I agree with this article that white box testing doesn’t make much sense and seems largely like a waste of time and resources. Additionally, you can only ever cover as many data transformation edge cases that are present in the production data, which may not be all of them. This is in contrast with black box testing, where for your input test data you could synthesize or extract and modify production data in order to cover all the possible transformation cases to ensure that your ETL logic implementation is correct. I suppose you could also create test data to be used with the white box testing methodology, but it doesn’t make sense to me that you would do all that especially after you’ve had to recreate the ETL process.
So in order to do black box ETL testing, I developed a framework in order to do test data extraction as opposed to test data generation in order to develop test data to test the ETL process. I did this because I could easily drop the data into our ScalaTest testing environment to be able to locally and quickly validate ETL implementation. For now, it is a partially manual process, but I hope to be able to fully automate it one day. I start by selecting a random entry in the primary table and then use that PK to find its related entries in other tables. Then I download that data as a CSV file and run it through a Python program to mask any data that I wouldn’t want in a test data set like real names, addresses, or maybe even some numerical amounts. This data then gets all packaged up into individual table test data parquets for the specific ETL file (e.g. test_etl_name_table_name.parquet).
Data Quality Reporting and Filtering
Something else I was in charge of on this project was data quality checking, both for reporting and filtering. For data quality reporting, I developed my own Python library to generate SQL data quality checking queries from a JSON configuration file. This was so that we could query our data that was in our data warehouse, or use SQLite local in-memory tables to query Parquet/CSV/JSON files. For filtering, we pass our data to downstream teams to load data. They call APIs which have their own validations sometimes causing the data to fail loads. We decided to incorporate some of those validation checks into our data processing steps so that we wouldn’t pass bad data downstream. For this, we would run all the data quality checks on the data and keep track of the primary key (PK) of the records so that we could filter out that entire PK to be passed down. I did this because of how relational data was; if data from one table was bad and we withheld it, it could compromise the successful load of the other tables.