Classification with Weka

The concept of “Artificial Intelligence”, “Deep Learning” and “Machine Learning” are getting so popular among in every society, nowadays. People mostly thinks that AI is so dangerous because robots might become so intelligent to take over the world. But in fact, the whole concept is old and unless you give a soul and consciousness to a robot it is not possible for them to conquer the world. In the end, every move and every result comes from a calculation!!!

Machine learning can be divided to 3 main topic in terms of learning: Classification, Clustering and Reinforcement Learning. Depends on the data type, the technique will be decided. Basically, If you have labeled data it is good to use classification method which also known as Supervised Learning. If your data is unlabeled but has a certain characteristic then you can use clustering which also known as Unsupervised Learning. If your data are mixed or for example, If you’re going to train a robot then Reinforcement Learning is right to use. It isn’t that simple of course, but from a basic perspective we can explain like above.

In the previous article, we explained how to use Weka and how to do preprocess using Weka. Today, we’ll talk about classification and how to do classification using Weka.

Classification

The concept of classification is simply to distribute data among the various classes defined on a data set. Classification algorithms learn this form of distribution from a given set of training and then try to classify it correctly when it comes to test data for which the class is not specified. The values that specify these classes on the dataset are given a label name and are used to determine the class of data to be given during the test.

Today we will use Iris dataset to illustrate usage of classification with Weka. You can download the dataset from here, I’ll use iris.data from the link. Since, Iris dataset doesn’t need for preprocessing, we can do classification directly by using it. WEKA is a good tool for beginners it includes tremendous amount of algorithms in it. After you load your dataset, by clicking Classify section you can switch to the section which we will talk about in this post. If you don’t know how to load a dataset and how to preprocess with WEKA, you should read previous post from here.

Figure 1. Classify Window

In Classify section, as you can see in the Area 1 according to Figure 1, ZeroR is the default classifier for Weka. But since, ZeroR algorithm’s performance are not well for Iris dataset, we’ll switch it with J48 algorithm known as very good success rate for our dataset. By clicking to the Choose button from Area 1 on the above figure, a new algorithm can be selected from list. J48 algorithm is inside of trees directory in the Classifier list. Before running the algorithm we have to select the test options from Area 2. Test options has consist of 4 options:

  1. Use training set: Controls how successfully classified your model on the training set which you had already train your model with them.
  2. Supplied test set: Controls how successfully classified your model on the dataset you supplied from externally. Select a dataset file by clicking Set button.
  3. Cross-validation: Cross validation option is widely used one, especially if you have limited amount of dataset. The number you enter in the Fold are used to divide your dataset into Fold number (let’s say it is 10). The original dataset is randomly partitioned into 10 subsets. After that, Weka uses set 1 for test and 9 sets for training for first training and uses set 2 for testing and rest of 9 sets for training and repeat that 10 times in total by changing the set each time with the next one. In the end, average of success rate is reported to user.
  4. Percentage split: Divide your dataset into train and test according to the number you enter. In default the percentage value is 66%, it means 66% of your dataset will be use as training set and 33% rest will be your test set.
Figure 2. Parameters of Algorithm

By clicking the text area, (the arrow point on the Figure 2) you can edit the parameters of algorithm according to your need.

I chose the 10 fold cross validation from Test Options with using J48 algorithm. I chose my class feature from drop down list as class and click the Start button from Area 2 in Figure 3. According the result, success rate is 96%, you can see it from the Classifier Output has shown at the Area 1 in Figure 3.

Figure 3. Classification Result

Run Information in Area 1 will give you detailed result as you can see in Figure 4. It consist from 5 part; the first one is Run Information which is give detailed information about the dataset and the model you used. As you can see in Figure 4 we used J48 as a classification model, our dataset was Iris dataset and its features is sepallength, sepalwidth, petallength, petalwidth, class. Our test mode is 10-fold cross-validation. Since J48 is a decision tree, our model is created a pruned tree. As you can see on the tree, first branching is happened on petallength which shows the petal length of flower, if value is smaller and equal than 0.6 the specie is Iris-setosa, otherwise there is another branch checks other specification, decide the specie. In tree structure ‘:‘ represents the class label.

Classifier Model part illustrate the model as tree and gives some information about the tree like number of leaves, size of tree e.g.. Next is Stratified cross-validation part and it shows the error rates, by checking this part you can see how successful your model is. For example, our model is correctly classified of 96% of train data and our mean absolute error rate 0.035 which is acceptable according to Iris dataset and our model.

Figure 4. Detailed Classification Result

You can see Confusion Matrix and detailed Accuracy Table below of the report.  F-Measure and ROC Area rates are important for the models and they developed according to the confusion matrix. Confusion matrix represents you True Positive, True Negative, False Positive and False Negative rates. If you don’t know meaning of this rates you can check further part, otherwise you can directly skip to the Visualizing the Result part.

Confusion Matrix

True Positive means that actual class of classification is equal to the class your model classified. Lets say you have a dataset which consist of patients information, some of them are cancer some of them aren’t. And you’re creating a model to diagnosed patients automatically in terms of their blood analysis result. So, you give a patients informations to your model and your model result as cancer. Then, you check actual expertise result with yours if he/she is diagnosed as cancer, this is False Negative (FN). If expert said that this patient is not cancer, at all. It means that your model is wrong and this is False Positive (FP). FP means you found negative result but it supposed to be positive. In this case, our positive result is being healthy and negative result is being cancer. Oppositely, if your model found that patient is healthy, but expert said that he/she is cancer. This is True Negative (TN) because you said result is positive but actually it is negative. If expert says that he/she is healthy too, then your result counts as True Positive (TP). It might seem a bit confused at first sight, but while you’re using you’ll find this results easily.  Confusion matrix is consist of this 4 value.

Visualizing the Result

If you’d like to visualize this results you can use graphically presentations as you can see in below Figures 5.

Figure 5. Visualize Tree – 1

By right click to Visualize tree you’ll see the your model’s illustration like in Figure 6.

Figure 6. Visualize Tree – 2

If you’d like to see classification errors in illustrated, select Visualize Classifier Errors in same menu. By sliding jitter (you can see in Area 1 at Figure 7) you can see all samples on coordinate plane.  X plane represents predicted classifier results, Y plane represents actual classifier results. Squares represents wrong classified samples. Stars represent true classified samples. Blue colored ones are Iris-setosa, red colored stars are Iris-versicolor, green ones Iris-virginica species. So, red square means our model classified this sample as Iris versicolor but it supposed to be Iris virginica.

Figure 7. Visualize Classifier Errors

If you click on one of the squares, you can see more detailed information. I clicked one of the blue one as shown in Figure 8, and saw which sample is classified wrong in detail. But, why would we want to see wrong classified sample in detail?

Figure 8. Classified Sample in Detail

You have so many samples which you have to classified in machine learning. Sometimes, looking by yourself to the samples, gives you basic ideas to robust your classifier model or find outliers which are irrelevant information for the data you use etc. So, however we call it as machine learning, most of the time it depends human to control the datas in dataset. |
+—–+————————-

Leave a comment

Your email address will not be published. Required fields are marked *