Create a pipeline with CDAP
The Data Engineer defines, develops, implements and maintains the appropriate tools and infrastructure for data analysis by Data Science teams. He makes sure to create a solution allowing the processing of large volumes of data while guaranteeing their security. It represents the first link in the data processing chain. Basically, his job is to make the data accessible to everyone. To do that, he must master programming languages like Java, Python or Scala, the Hadoop ecosystem and databases. That’s with those skills he will be able to set up complex pipelines for the extraction, transformation, and storage of data. Besides, the data engineer will also have to focus on infrastructure and integration to have a functional pipeline. Fortunately for him, there is a solution that allows him to focus solely on building a pipeline rather than bothering with integration and infrastructure.
So what is this solution that will simplify the creation of pipelines for us data engineers?
This is the CDAP (Cask Data Application Platform).
CDAP is an open-source application development platform integrated into the Hadoop ecosystem which provides developers with data abstractions and applications to simplify and accelerate application development, process a greater number of use cases in real-time and in batch mode, and deploy applications in production while satisfying the needs of the company.
Indeed, it is a software layer that runs on Hadoop platforms such as Cloudera Data Hub and Hortonworks Data Platform. It accelerates the processing time of big data thanks to standardized APIs, models, and visual interfaces. It also allows people who are not developers (marketing agent, sales agents, BI, …) to easily manipulate the data, to process it, … without having to write a line of code.
It provides essential functionality for any Big Data application :
- Data collection: ETL
- Data exploration: Execute queries to find out their structure
- Data processing: Mapreduce, Spark
- Data storage: datasets
A CDAP application is nothing more than a combination of these components:
- Plugins for data collection;
- Data processing: MapReduce, Spark for data processing in real-time or in batch;
To illustrate the simplicity and efficiency of CDAP, we will create a simple pipeline that will retrieve data on S3, process it and then store it on S3.
To set up this pipeline, we will use the S3 source, Wrangler, DateTransform and S3 sink plugins.
S3 Source, S3 Sink: These plugins allow you to connect to S3 to retrieve or store data.
Wrangler: It is one of the most used plugins on CDAP because it allows data processing in several formats (text, JSON, CSV, Avro, …).
DateTransform: It is used to parse a field (String, timestamp) into a Date type.
Configuration of the pipeline
Click on properties on each plugin and make the following configurations:
We can do all the processing graphically but not to overload the article I will provide you with the code to paste on the Wrangler terminal.
Click on properties then on Wrangler and paste the code found on the link below on the terminal:
Click on apply to finish
Then we click on preview then run: If everything goes well, we can go to deployment by clicking on preview then deploy.
NB: If you don’t have an amazon account, you can choose the CDAP TABLE or FILE as a sink.
NB: The data are in the following public bucket: s3://cdaptest2019/cdr_data/