Combining Apache Spark, XGBoost, MLeap and Play framework to predict customer’s churn

Combining Apache Spark, XGBoost, MLeap and Play ! framework to predict customer churn in Telecommunications Companies

Cancellation of services by existing customers is a phenomenon that is omnipresent in the business world and particularly common in highly competitive economic environments.

The telecommunications sector is in constant turmoil due to the strong attractiveness of this environment. Users are less likely to put up with an non functional service, they walk away easily to subscribe to a competitor.

From there, we understand the extent of customer churn in telecommunications companies.

A Data Enthusiat’s team from BAAMTU have implemented a solution that can reduce that phenomenom of churn by combining Big Data and Machine Learning’s techniques. 

But what exactly is churn?

Customer churn can be defined as the process by which a customer permanently suspends the use of services to which he or she has previously subscribed. This suspension may also be synonymous with non-use for a given period of time.

The marketing strategies traditionally used by telecommunications to manage their customers and prevent churn are often in the order of three:

  1. acquiring new customers
  2. selling incentives to existing customers
  3. retaining customers

The use of each of these techniques is linked to a certain cost for the company. Acquisition plays a very important role in the marketing strategies of telecommunications companies. And most of them are solely focused on it.

This is far from optimal because as long as the acquisition part continues to monopolize marketing’s efforts of companies, customer retention will continue to suffer and the churn rate will continue to increase.


What about churn rate?

Churn rate is a measure of the number of the company’s customers stopping the use of their services calculated over a given period. Its measurement is very important because its fluctuations can inform on the “state of health”, if I may say so, of the business.

This churn rate therefore, when not controlled, reduces significantly the acquisition efforts made by the marketing unit. Hence the urge to put in place an effective retention policy.


But what to do when you have an overwhelming number of customers?

Telecommunications companies often have a very large number of customers to manage and therefore cannot afford to focus on each of them in an isolated manner.

But, imagine if they could have the information that such a customer is very likely to leave them. This can reduce marketing efforts related to retention by redirecting them to the specific type of at-risk customers.



The idea is to set up a model that predicts churn, in other words a tool that can tell us that a given customer will churn with a certain appetence score. The objective is to allow a manager to make these predictions interactively with a relatively low latency time and to have a visual stratification of clients in relation to their appetence scores.

The predictive model to be used will be based on ‘historical’ data. By historical data, we mean data from customers who have stopped using the services but also from customers who remain in the service in accordance with the machine learning ideology.

In the following part we will discuss the tools necessary to achieve our goal.



Apache Spark

In terms of customer knowledge, the more heterogeneous the data describing them is, the better it is for us!

Thanks to the distributed paradigm, the storage and the processing of large and various datasets is no longer a caveat.


Apache Spark is a massive data processing engine that offers several processing options including batch, real-time, graph and machine learning data processing.

Spark Machine Learning library, Mllib, supports several classes of classification, regression, dimensionality reduction algorithms and a large number of tools for data pre-processing.

For the creation of our model, we use the XGBoost library through its JVM packages combined with Apache Spark. This allows us to gain speed during training, the choice of parameters through cross validation is also done in a distributed way.

We use Spark through its scala API.




XGBoost or eXtreme Gradient Boosting is an algorithm based on the boosting of the gradient descent. Boosting in contrast to bagging consists in the sequential execution of a given number of decision trees, each of which corrects the errors made during the previous iteration in a sequential manner.

Models are added sequentially until no improvements can be made. The final prediction is obtained by summing the predictions made by the different decision trees.

The term gradient boosting refers to the use of the gradient descent algorithm to minimize loss or error when adding new models.




We often need to provide the end user with the ability to interact with our model either through a web application, mobile or through APIs…

To do this, the driven model needs to be usable on all types of platforms. This is where the MLeap project was created. MLeap is the probable successor of the PMML format, the export format for machine learning templates in xml format. Mleap offers the serialization of coded models with the Mllib library of Spark, Scikit Learn, Tensorflow under a JSON or protobuff format and export them under a single format called Bundle, usable in particular on platforms running the JVM (Java, Scala,…).

The models can be used independently (without any dependence) of their original training platform (Spark, Tensorflow, Scikit Learn,…) by using only the runtime: MLeap Runtime.

We use MLeap through its scala API.



Play is an MVC framework for creating web applications.

Play offers support for Java and Scala programming languages.

Better yet, Play! comes packaged with its own server, Akka since version 2.6.X and Netty since its last version. There is therefore no need to configure a web server in its development and production environment.

In addition, it offers an asynchronous motor based on Akka.

In our case, we use Play Through its scala API to set up a web application through which a manager can send prediction requests for his clients and view the results.



As you may have noticed, we love Scala. Why? Well! It’s quite simple enough.

This following phrase describe best my feelings towards the langage, I think :


It’s functional, it’s object-oriented, it’s…, it’s everything you needed and more!


As you will have understood, Scala is an object-oriented programming language in the sense that everything is object and functional in the sense that every function is value. Scala has excellent type management support and also offers inference of these types.


Description of the Data used

As mentioned above, we are implementing a predictive model of customer churn , i.e., we propose a model that predicts the odds of a given customer to discontinue their use of the company’s services.

The model produced is the result of supervised learning and more precisely of a binary classification and therefore require labelled data (telling about the variable to be predicted)

The following table describe the different informations present on the data :



Account Length Equivalent to the client’s seniority in the business
VMail Message Aggregation of customer voice messages
Day Minutes Accumulated minutes of the client’s actions during the day
Eve Minutes Accumulated minutes of the client’s actions during the evening
Night Minutes Accumulated minutes of the client’s actions during the night
International Minutes Cumulative minutes of the client’s international actions
Customer Service Calls Cumulative calls from the customer to customer service
Churn Target variable that we are trying to predict, whether or not the client is churning
International Plan Does the customer have access to the International plan
Vmail Plan Does the customer have voicemail plan
Day Calls Cumulative calls from the customer during the day
Day Charge


Accumulated credit amount of the customer’s shares during the day
Eve Calls


Accumulated customer calls during the evening
Eve Charge Accumulated credit amount of the customer’s shares during the evening


Night Calls


Accumulated overnight calls from the customer


Night Charge


Accumulated credit amount of the customer’s shares during the night


International Calls


Cumulative international calls from the customer


International Charge Cumulative credit amount of the client’s international shares
State Customer’s state of origin
Area Code Area Code
Phone Customer’s telephone number


These data include a few categories of customer data commonly used in telecommunications companies, namely:


  1. Usage data: any data related to the customer referring to his use of services, calls, voice messages, internet services…
  2. Interaction data: informing about the customer’s interactions with services such as customer service, call centers, etc.
  3. Context data: any other data that may characterize the customer, such as personal data


Functional architecture

It can be noted that contextual data such as state, area code and phone are not very significant in the problem we are trying to solve. These variables were therefore excluded from the process.

So we will no longer have to deal with categorical features in the process of building our model.

Tools needed to build a model with Spark for Machine learning and artificial intelligence

As said earlier, Spark provides a set of tools to deal with the different tasks involved in a model building process:

  1. Transformers: involved in the tasks of scaling, converting or modifying features;
  2. Extractors: help in extracting features from data;
  3. Selectors:  provides us the  ability to select a subset of features from a larger set of features and
  4. Estimators: abstracts the concept of a learning algorithm that fit on data


Our  first step consists in doing some exploratory tasks on the data: numerical statistics on the columns of numerical variables, correlation and intercorrelation study and predictive power of each variable. These tasks aim to provide a pretty good understanding of the data we are given.

While preprocessing, we bucketize our continuous features. The Spark ML’s Bucketizer transformer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users.

We finally create a unique vector, using the Spark ML’s VectorAssembler transformer, of all our features. This allow us  to perform a dimensionality reduction using Spark ML’s PCA transformer.

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.


val assembler = new VectorAssembler(“Churn_Assembler”)

stages :+= assembler

val reducer = new PCA(“Churn_PCA”)

stages :+= reducer


The estimator used here is the XGBoostClassifier class of the XGBoost classification package.It is provided the appropriate parameters obtained after cross validation of the different parameters. For a explanation of the meaning of these parameters and further, one can visit  the XGBoost website.


val xgbParam = Map(

                                    “eta” -> booster_eta,

“max_depth” -> booster_max_depth,

“objective” -> booster_objective,

“num_round” -> booster_num_round,

“num_workers” -> booster_num_workers,

“min_child_weight” -> booster_min_child_weight,

“gamma” -> booster_gamma,

“alpha” -> booster_alpha,

“lambda” -> booster_lambda,

“subsample” -> booster_subsample,

“colsample_bytree” -> booster_colsample_bytree ,

“scale_pos_weight” -> booster_scale_pos_weight,

“base_score” -> booster_base_score



val xgbClassifier = new XGBoostClassifier(xgbParam)



The classifier is then added to our stages to form the final pipeline:

stages :+= xgbClassifier

val pipeline = new Pipeline(“Churn_Pipeline”).setStages(stages)

The model is obtained by applying this pipeline to the training data. This resulting model is exported as a Mleap Bundle in protobuff format:

implicit val sbc : SparkBundleContext = SparkBundleContext()

(for (bundle <- managed(BundleFile(“jar:file:/tmp/”))) yield {

The exported model will be used in our web application to perform churn prediction for new customers online.

val dirBundle : Transformer= (for(bundle <- managed(BundleFile(“jar:”+cwd+“conf/resources/”))) yield {

val transformedLeapFrame = dirBundle.transform(leapFrame).get



Eventually, we will need to add customer information for scoring. To do this, you can add them individually or add several at a time by uploading a file. As a reminder, since the model is the result of supervised learning, to use it it it must be given data with the same name and containing the same number of variables as those used during training as shown in the screenshot below.

Churn : how to score customer ?
Intelligence on customers to score


The next step after adding clients is the actual prediction using the previously exported model.

Customer scoring
Output of the model to predict customer’s churn

The data shown below are output by the model. They provide information on the probability and confidence of the client to belong to each of the two classes, in other words, the appetence score of a client to be “churner” or not.

Depending on these values, the customer is predicted as a potential “churner” or not in respect to the chosen threshold. The threshold here is a limit value defined by the company.

Customer's prediction
Score your customers to practice chirurgical marketing actions

These predictions can eventually be stored in a database for future use.

Depending on their churning appetence scores, customers can be classified as “high risk”, “medium risk” or “low risk”.

Comportement d'achat
What’s the probability for a customer o churn ?

This segmentation can help redirect retention marketing efforts more accurately to targeted customers, such as sending targeted messages:

Return on Investment
Dose your marketing’s investment by scoring your customer



The phenomenon of churn is omnipresent in telecommunications, due in particular to the high level of competitiveness that prevails there. Thinking about this phenomenon in a different way by advocating retention could be very beneficial. Through prediction, the use of machine learning techniques coupled with Big Data can help redirect marketing strategies aimed at retention to the right customers.

On another level, considering that a customer is influenced by his peers (other customers with whom he often communicates), if many peers have churned, it is very likely that he will follow.

Considering information no longer limited to the user interaction with the companies’ services, but also information related to its interaction with its ‘network’ can help increase the performance and accuracy of a model that predicts customer churn.


Awa Thiam

Leave a Reply

Your email address will not be published. Required fields are marked *