What is Bayesian Classification

Puzzle ITC - Changing IT for the better

November 21, 2016

Blog post series part 2: On the trail of logic

In the first blog post I introduced a practical use case for machine learning and outlined a rough solution. In this second blog post I would like to take a closer look at the algorithms used. Therefore, this is probably the driest blog post we have ever published here, but to understand how the described machine learning approach works, we have to dive a little into the math.

When looking for a solution for allocating the budget items, I received the tip that at best a Bayesian classifier is a possible approach. Through tests and various experiments, I have found over time for the problem of budget item allocation described in the first blog post that the use of a Bayesian network brings the best results. What was also exciting in the search for a solution was the fact that such a classification gives better results than much more complex approaches (such as neural networks or the algorithm of the nearest neighbor). It should already be clear at this point that there is no simple solution to the subject of machine learning. The solution shown here is also specific for this application.

To explain the mathematical approach behind budget allocation, let me first define and explain a few terms;

Definitions of terms


Attributes correspond to properties of an object. For example, a “Payment” object can have an “Amount” attribute. Every attribute has a set of values ​​that it can take on.

Attribute value

Attribute values ​​(also called characteristics) are defined units that belong to an attribute. These can be numerical values ​​or a fixed set of values, e.g.


The class, also known as the class attribute, is the attribute you are looking for. Each classification tries to search for a class or to calculate a class value on the basis of a set of attributes.

Class value

A class value corresponds to a value that the class can assume, e.g .:


An object has properties, e.g. a payment.


Once classified, objects are referred to as “examples”.

Training data

Training data is the set of possible examples that can be used as references for future classifications.

Training set

The training data that is used as a learning basis for a classification is called a training set or test data.

Since a set of attributes can also be viewed as the coordinate of a vector in n-dimensional space, we also speak of property vectors. Thus, an example or object of the training data corresponds to a vector or a feature vector in space.


Is referred to in space as a property vector or feature vector.

Training data

Now correspond to a set of classified property vectors.

In the case of budget item allocation, a payment would be an object for which we create a property vector. This is assigned to a class attribute by means of classification, in the case of budget item allocation to a budget item.

For the use of a classifier as an allocation basis for the selection of budget items, the use of a self-learning automatic classifier made sense to me. The aim of classifying learning is to predict the characteristics of a certain attribute, the values ​​of which are only known for part of the data. In other words: We do not know the class for all the properties of a payment. Examples for which we know the class are called classified examples. Accordingly, we do not know the class of unclassified examples.

Classifying learning consists of two steps. The first step consists of building or training a classifier, with which we search for and assign classes for unclassified property vectors in a second step. This assignment is called classification.

Bayesian classification

A Bayesian classifier assigns each object to the class to which it belongs with the greatest probability or for which the classification results in the least costs. It is a mathematical function that assigns each point of a feature space to a class. The Bayesian classifier uses a cost measure that assigns costs to each possible classification. The Bayesian classifier tries to minimize these costs and to find the optimal allocation. In the simplest case, a cost measure is assumed which only incurs costs if there is a wrong decision. The aim of the classifier is therefore to keep costs as low as possible. However, the prerequisite for the creation and assignment of a cost measure is that it is known with what probability a point in the feature space belongs to a certain class. However, since this probability is not known in a practical application, it is estimated. A probability distribution (usually a normal distribution) is assumed and an attempt is made to estimate its parameters.
The Bayesian classifier applies the mathematical principles of Bayes' theorem.

Bayes' theorem

Bayes 'theorem, or Bayes' theorem, describes how to calculate with conditional probabilities. Bayes' theorem for two random events A and B is as follows:

The conditional probabilities and let's call them a posteriori Probabilities. In contrast, we denote the probabilities and as a priori Probabilities.

Bayes' theorem can be illustrated with a simple urn example. In two urns A and B there are red and white balls. In A there are seven red balls and three white balls, in B one red and nine white balls. Now any ball is drawn from A or B. Whether it is drawn from A or B, be a priori equally likely. The ball drawn is red. Find the probability that this ball comes from urn A. Since the choice of the urn is equally likely, the result is:

Since the number of balls and their color in both urns is known, we get for urn A:

and for urn B:

this results in a probability of pulling a red ball:

Using Bayes' theorem, we now get the probability that a red ball from urn A comes from:

Naive Bayes classification

Bayes' theorem enables the probability of a class to be estimated for a given classification example. This probability can be used to classify the feature vector by predicting the class for which the highest probability has been estimated. We are at the a posteriori Probabilities for a class under the occurrence of the vector Interested. Let's sit and into Bayes' theorem, we get the following calculation approach:

In the case of the naive Bayesian classification, the classification problem can be expressed as the probability that a payment, given by the feature vector , the class is to be assigned to be described. So for all class attributes the conditional probability becomes calculated.

If all are known, a payment with feature vector in the class be classified for which assumes the greatest value. Since only nominal attributes are used in the budget item assignment, I will limit myself to the calculation with nominal values. Since rarely all are known, the following approach is followed:

in which the number of training sets and the number of training sets in the class have been assigned. The absolute frequency of the class is divided by the number of training sets. To determine this, all training sets are used go through and is increased accordingly. The probability that exactly the vector we are looking for is already in the training set, is relatively small or does not exist. The lack of vectors would be useful in estimating the probabilities lead to very small, insufficiently meaningful values. Therefore becomes a naive Assumption made by assuming all attributes to be independent. This results in a possibility of estimating the desired, conditional probabilities. Because of this assumption of independence, this becomes Bayesian classifier as more naiveBayesian classifier designated.

The probability can be estimated independently of the other attributes. For this purpose, the relative frequency of the vectors for which the attribute the value and the class attribute is, by the relative frequency of the class divided.

corresponds to the number of training sets with a class attribute and attribute value for the attribute . This can already be used to determine the number of training sets with class attribute when running through the training data To be defined. However, there is a problem with this: if the absolute frequency of an attribute value the class Is 0, this also applies to the probability and thus also for the product of . As a result, the remaining probabilities become irrelevant. One solution to this problem is that m-estimate. This is based on an a priori estimate the probability out.


m stands for the equivalent sample size. This is a constant that determines how strong is weighted relative to the trained data. The name equivalent samplesize comes through a possible interpretation as an extension of the Attributes in the classes around virtual attributes, which are distributed accordingly. Using the m-estimate the following formula results for the naive Bayesian classification:

Bayesian networks

In contrast to the naive Bayesian approach, the application of a classification using Bayesian networks does not assume that the attributes are independent. Dependencies between individual attributes can be defined in the form of a directed acyclic graph (DAG), with nodes representing features and their edges representing dependencies between features. The naive Bayesian classifier can also be viewed as a special case of a Bayesian network classifier in that a star-shaped Bayesian network is created in which all attributes only depend on the class.

Each node can have a probability value from 0 to 1. The probabilities of the successor nodes are calculated with these probabilities. A table with conditional probabilities is calculated for each node. If a node has only one predecessor node, this can be a simple conditional probability.

The table of conditional probabilities contains direct predecessor features for each possible combination the conditional probability for the characteristic values as . This allows with

the probability for a certain path of characteristic values ​​can be calculated. One or more nodes of the graph are selected as target nodes for the classification. These correspond to the group according to which the objects are to be classified. In order to classify an object, the probabilities of the paths to the selected target nodes are compared and the target node with the highest probability is selected.

Back to practice

Since I do not want to implement the complete algorithm in an application, I decided to use WEKA, a Java library that provides a large number of algorithms ready-made. The principle of the calculation remains the same. For each feature of my payment, the probability for each possible budget item is calculated. In the end, WEKA gives me a list of my budget items and the respective percentage allocation probability. I only have to select the budget item with the highest value and get a relatively high hit rate. Of course, everything can still be optimized. For example, through the use of a Bayesian network, in which dependencies are defined under the payment characteristics. For example, I can define that the type of purchase is directly dependent on the day of the week. Since I usually do bulk shopping and clothing purchases on Saturdays, this already results in an improvement.

How I integrated Weka in my solution and how to work with the API of the library I will explain in the next blog post.

Read the third part of the blog post series here.

Read the first part of the blog post series here.