In data science, algorithms are used to solve problems. The biggest challenge in data science is one’s ability to break down the problem into a set of rules or procedures that best describes the problem. Once this is achieved, the solution is attainable. For example, there is a need to sort a list that holds students’ matriculation number. With a knowledge of the different sorting algorithms (alphanumeric approach, binary transformation etc.), these matriculation numbers will become input to the procedure and the ordered list as the output as shown in the figure below:
(u2022/005, u2022/004, u2022/090) Model → (u2022/004, u2022/005, u2022/090)
Given there are different methods/algorithms available to carry out this task, the task manager must choose the algorithm that offers the best performance (process speed, accuracy, deviation rate and key performance indicators KPIs).
While this task can be done manually with the chosen algorithm, we can use machine learning algorithms to do such in a bid to save time, maintain high performance and other benefits. In the learning algorithms, we apply sample data xi,x2, x3, x4, x5 feed it to the system and train the system to produce a template usually called a model (Y). The model is not a perfect function however there should be a good approximation of the perfect function.
xi,x2, x3, x4, x5 Training Process → Y= 2×2 5×1 4 3 9×4 7×2 6×4 5×12 ax
This model Y is expected to understand the transformation process or prediction sequence. In future, the task manager will feed a new set of data xa,x9, xc, x4, xd also called test data or production data) into the model Y to produce the approximate of the expected result (price forecasting, grouping of certain items based on a defined feature) as shown in the figure below:
xa,x9, xc, x4, xd TestingProductionProcess→ Y= 2×2 5×1 4 3 9×4 7×2 6×4 5×12 ax Output ⇒ Y=5 cm
This process is called “Machine Learning”. As the model undergoes tuning, its performance usually gets better hence improving the task manager’s knowledge.
Machine Learning is simply giving computers the ability to learn without being explicitly programmed (Arthur Samuel, 1959).
Machine Learning Algorithms
Machine learning algorithms can be classified into three categories: Supervised Learning, Unsupervised Learning, and Semi-supervised Learning.
In supervised learning, the model undergoes a set of training based on known classifications/labelling and through supervision, it attempts to learn the patterns and trends that are hidden within the training dataset xi,x2, x3, x4, x5. This approach is used for prediction and decision-making.
In unsupervised learning, the model attempts to find some information, trends or patterns from the training data set without any clues. This unsupervised learning usually produces clusters (collection of different data points based on a common feature or characteristics). This is approach is used for cluster analysis.
Semi-supervised learning is a mix of supervised and unsupervised learning. In this scenario, the system carries out training on a set of data: a certain portion of it xi, x2, x3, x4, x5 have sufficient clues (labels) and the other portion doesn’t have clues xo, xo, xo, xo, xo . This system is expected to develop the model with the data set of xi,x2, x3, x4, x5 and classify the data set xo, xo, xo, xo, xo . This kind of learning is used for speech recognition, and natural language processing applications.
In the next section, I will discuss different aspects of the machine learning and their industrial application
Regression is a supervised machine learning method used to predict values (continuous or discrete). In this approach, the model makes an attempt to build a relationship between the target, usually denoted as y and the system inputs xi,x2, x3, x4, x5 . The main aim of this supervised learning method is to generate a constant K that best defines the relationship between the input and target variables. This constant K can be expressed in a straight line, curve or geometric shape. The basic regression technique used in most business entities is linear regression. Applying linear regression to a set of input data (xi) to predict y, the algorithm constructs a model that is expressed in equation 1.1
Where K is the constant that defines the relationship between the input xi and output y
b is the intercept and,
e is the error that occurs while predicting y1 with its corresponding value of xi
Some assumptions that must be observed while running this algorithm on a set of data. They are:
- There must be a linear relationship between the input data and the target
- There should be zero outliers (anomalies) in the input data
- There should be low or zero multicollinearity as well as autocorrelation
- The input data points should be independent.
- For a perfect prediction to be attained, error at all instances should be normally distributed with mean 0 and constant variance.
In a case where the input data set has more than one parameter xi,x2, x3, x4, xn to predict y, the algorithm develops a model that can be expressed by equation 1.2
y=Kxi+Lx2+- — + Zxn———–1.2
K, L and Z are the constants that describe the relationship between the inputs xi,x2, ……., xn and the target variable y.
Other types of regression algorithms that can be used for variable prediction are Polynomial Regression, Logistics Regression, Quantile Regression, Ridge Regression, Lasso Regression, Elastic Net Regression, Principal Components Regression, Partial Least Squares Regression, Support Vector Regression, Ordinal Regression, Poisson Regression, Negative Binomial Regression, Quasi Poisson Regression, Cox Regression, Tobit Regression and others.
Below are some business applications of the regression algorithm:
- Predictive Analytics
With the use of regression algorithms, business owners can develop a model that will predict or forecast their products in terms of quality or/and quantitative. If the business can define their input data and data archives for previous business processes, then they can use their business output whether it is a product or service. In recent research conducted by S.N. John and his team. A smart fraud detection in the banking sector using data mining algorithms was developed. They consider different algorithms including regression. The input data in this study are checking account, credit history, purpose, employment, gender, marital status, housing, job, and age while the target variable is the credit score. In their work, their major result was that “it is safer and more advisable giving loans to individuals that are single”. To find more on this research, follow the link
- Operation Performance Enhancer (optimization)
With the application of a data archive and a sound knowledge of regression models, business owners can develop a model to understand how each input parameter affects their output and make changes in a bid to optimize their performance. In my research work on “predicting student performance in engineering drawing using supervised learning methods”, my team developed a predictive model to assess student performance using logistics regression. The input variables used in this analysis are the student’s department, sex, state, the number of student’s practice hours per week, does the student own a tee-square, does the student own a set-square, does the student own a French curve, does the student own a rotary set, does the student own an HB pencil, does the student own 2B pencils, did the student offer technical drawing in secondary school, age, number of tutors and student resumption date. The target variable was their final course score. One of our findings was students’ average number of personal practice hours had a 51% impact on their performance on the course hence students can improve their performance on the course if they commit a good number of hours in their preparation. Click here to read this research journal.
- Improve business decisions with data
The era of the “trial and error” method has been phased out since it is an expensive process that consumes much resources and time and produces little or no result. The application of predictive models can greatly aid the decision-making of business owners thereby eliminating the guess and “trial and error” method. Product managers can combine their experience expressed in hypotheses and test them using predictive models. This will ensure they choose the best option while saving their resources and time.