Extract, Transform and Loading (ETL) Data Phase
After collecting/gathering data successfully from the different sources, the data experts (data scientists/engineers) have a primary assignment to extract data from different data sources (SQL or NoSQL servers, Flat files, Email, Web pages, Online forms). This process takes place immediately after the data collection step. The collected data are usually in unstructured form. Data extraction is always organization relative since the data team must get the right data columns and rows that can be used to answer the business questions.
The next step in the ETL data phase is the data transformation. The extracted data is usually raw and unusable in its original state. They must undergo cleansing, formatting and validation to be fit for data analysis and/or data visualization. The step usually involves the tasks below:
- Removal of Duplicates and/or irrelevant data points
- Handling of missing/null/NA values to ensure the consistency of the dataset.
- Perform joins/merge/appends to relevant data columns to create new fields and subset that will improve the analytics phase.
- Check or prove the validity and the accuracy of the dataset to ensure data standardization by using look-up tables, the organisation’s check list (does this dataset fit for use?) and other procedures.
The final step in the ETL data phase involves data encryption and data storage. The data volume associated to mid-large size organisation is usually huge that are in Gigabytes to Exabytes. The data volume calls for a process where the data can be stored in the appropriate medium as well as security level.
The appropriate storage medium could be SQL or NoSQL Server, Data Lake, or Data warehouse. The data team must suggest the right medium that best fit the organization’s data governance. They must also consider the data loading techniques (initial, incremental, or full refresh) as well in a bid to optimize performance.