# Is the decision tree a linear classifier

## CLASSIFICATION AND REGRESSION

### Classification and regression methods

Classification and regression methods are prediction methods from the field of supervised learning. The goal is a model that, on a case-by-case basis, predicts the characteristics of a variable to be described using several descriptive variables. In machine learning, the model is trained with data in which the relationships are known, i.e. values ​​are available for the variable to be described. The models can then be applied to cases in which the values ​​of the variable to be described are unknown. A classification is used when the variable to be described is categorical, a regression predicts the values ​​of a continuous variable.

### Non-linear relationships

The classic linear equivalents are discriminant analysis and linear regression. In data mining, methods are used that can also use non-linear relationships and interactions between the explanatory variables for the prediction. The advantage of linear methods, however, is that the models themselves are easier to interpret. In most data mining processes, the model is referred to as a “black box”: With a good database, it can make very good predictions, but the model itself hardly contributes to an understanding of the interrelationships. The simple decision tree is an exception here.

What the methods presented below have in common is that it makes sense to split the training data set at random. The larger part of the data is used to train the algorithm, the smaller part to test the model on statistically independent data.

### Idea of ​​decision trees

Decision trees break down the data set in a tree-like hierarchical structure. At each branch one of the explanatory variables is used to divide the cases. The optimal variable and the optimal division criterion are searched for in each case. Optimal means that the two subgroups are as homogeneous as possible with regard to the variable to be explained. The homogeneity of the subgroups thus increases from branch to branch. Finally, a characteristic of the variable to be explained is assigned to the end node. This can be the mean value of the training cases in this node or its majority value. The end nodes represent cuboids in the multidimensional variable space. For each case there is now a clear path through the tree, which is determined by the characteristics of the explanatory variables. The prediction results from the end node in which the case finally ends.

### Intuitively interpretable result

A decisive advantage of this process is that the tree can be displayed graphically and interpreted directly. In addition, the tree can be expanded relatively easily by not only storing the optimal decision criterion for each branch, but also the second-best, third-best, etc. Thus, predictions can also be made if the values ​​of individual explanatory variables are missing for a case. One problem with the decision tree is the risk of overtraining: If the branches become too fine, the tree adapts too much to the training data and it is more difficult to generalize to unknown data.

### Extensions

A decision tree can not only be used as a model in itself, it also serves as a component of more complex procedures. A so-called random forest consists of many decision trees based on randomly selected variables. In addition, only a random subset of the training data can be used per tree. For a concrete forecast, the results of all these trees are averaged (regression) or the majority decision of the trees is used (classification). Another extension of a simple tree model is boosting: Here a sequence of relatively simple trees is built, with one tree trying to correct the errors of the previous tree. The simplicity of the trees reduces the risk of overtraining. Using control data, the point in the tree sequence can be found at which the model loses its generalizability and adapts too strongly to the specific fluctuations in the training data.

### Idea of ​​neural networks

Artificial neural networks imitate the learning process of the brain in a simplified way. The feedforward perceptron is mainly used for classification and regression. Based on the biological model, we speak of neurons. Such a perceptron consists of several layers of neurons. The first layer is input. Each explanatory variable is assigned to an input neuron. The last layer is the output, it represents the variable to be explained. Depending on the design of the neural network, there may be other hidden neuron layers between these layers. Each neuron is connected to the neurons on the next layer. When a neuron is activated, the stimulus is passed on to the following neurons via these connections. An activation function is used to decide whether the sequence neurons in turn pass on the stimulus. The aim of the training is to adjust the strength with which the stimuli are transmitted so that each input stimulus (the values ​​of the explanatory variables) leads to the best possible output (prediction for the variable to be explained).

Depending on the data basis and the relationships between the variables, different network designs can lead to the best results. Both the number of hidden layers (the “depth” of the network) and the number of neurons in these layers can be varied. One tries to keep the total number of neurons as small as possible. A network that is too complex tends to learn not only the underlying relationships, but also the random fluctuations in the training data.

### Idea of ​​the support vector machines

Support vector methods (SVM) try to find the optimal separation surface in the variable space in order to divide the data into separate classes. In their basic structure, SVMs are suitable for forecasting a dichotomous categorical variable. The focus of the procedure is on the cases that are the most difficult to classify, i.e. those that are closest to the cases in the other class. The positions of these cases serve as support vectors; the optimal separation surface results from maximizing the distances to the support vectors. All other cases are not taken into account. As can be seen, neither the coordinates of the support vectors nor the interface are explicitly included in the optimization, but only simple numbers (the scalar products between the support vectors and the surface normal, which in turn can be reduced to the scalar products between the individual support vectors).

### Non-linear thanks to the kernel trick

The so-called kernel trick makes use of this simplification: A transformation of the scalar product implicitly transforms the variable space. In particular, the cases can be positioned in a higher-dimensional space according to a non-linear rule. All non-linearities are taken into account in the transformation, so that the interface can again be linear. The strength of the kernel trick is that the simple transformation of simple numbers implicitly performs a highly complex vector transformation. Since only the scalar product is included in the optimization, the transformed coordinates of the support vectors and the interface never have to be calculated.

By combining several SVMs, categorical variables with more than two classes can also be described. If the SVM is to be used for a regression, the optimal interface is not sought, but the level that best describes the characteristics of the target variables.

### Idea of ​​the Naive Bayesian Classifier

The Naive Bayesian Classifier tries to make predictions based on conditional probabilities. The method is called “naive” because the explanatory variables are assumed to be independent. In the training data, the probabilities with which the variable to be described assumes a certain value as a function of the value of a descriptive variable are estimated. A prediction then results from the product of the probabilities of all descriptive variables.

### Contact Person D.ata mining

Johannes Lüken
+49 40 25 17 13 22
[email protected]