Let’s begin with the basic workflow. We have used the IRIS dataset for the workflow.
Splitting Data Into Training Data And Testing Data
For splitting the dataset, we use Data Sampler Widget. Now, we will split the data into two parts, 80% of data for training and 20% for testing. We will send the first 80% onwards to build a model and remaining for testing purpose.
Firstly, connect the file to Data Sampler. Then, click on the Data Sampler and amend the changes you want in the Data Sampler
It selects a subset of data instances from an input dataset.
- Data: input dataset
- Data Sample: sampled data instances (used for training)
- Remaining Data: out-of-sample data (used for testing)
So we have to pass the whole dataset into Data Sampler Widget. By default, the dataset splits into train(70%) and test data(30%) in the Data Sampler. In Data Sampler Widget we will split our dataset into train(80%) and test data(20%).
Effects Of Data Splitting On The Classification Model & Result
To know the effect, first we have to create a workflow which tests the learning algorithms(SVM, KNN and many more)on data and score them. For this we will use Test and Score widget which gets data from Data Sampler and, train, test and score learner algorithm.
Test and Score
Tests learning algorithms on data.
- Data: input dataset
- Test Data: separate data for testing
- Learner: learning algorithm(s)
- Evaluation Results: results of testing classification algorithms.
After splitting the data we have to connect Data Sampler with Test & Score Widget by connecting two lines one for train data and another for test data. so by clicking on link it opens Edit Links and we have to edit links as shown below.
Data sample(80%) -> Data( Train Data )
Remaining Data(20%) -> Test Data
This workflow utilizes Naïve-Bayes, Random Forest, Neural Network, and KNN ( K Nearest Neighbors ) Widgets to create the model. Machine learning methods are used in all of the widgets. You can connect all of your widgets with the Test & Score Widget as learner as done in the generated workflow(see the above images).
Test and Score widget must need two things to test and score as seen in above Test and Score section.
(1) Data ( Train & Test )
(2) Machine Learning Algorithm
Now after sending the models to Test & Score along with Train and Test samples, we observe their performance in the table inside the Test & Score widget. But before observing evaluation result we have to make Test and Score widget to evaluate on test samples by clicking option Test on test data on the left panel of widget as shown in the below figure. Because there are other option available for evaluation such as cross validation, Leave one out, and others. So while using test data we always test our model on test data.
Importance Of Separation Of Test And Training Data
Main importance of data separation is for evaluation purpose. Because overfitting is a common problem while training a model. When a model performs exceptionally well on the data we used to train it, but fails to generalize well to new, previously unseen data points, this phenomena happens.
So test data act as new, previously unseen data points and when model evaluates on the basis of test data we come to know the actual accuracy of model. Alternatively when model evaluates on the basis of train data it gives better accuracy compare to test data, reason behind this is that model already trained on same features which we used for evaluation purpose. But such models are not generalized for real world data they just overfit to training set.
So the effect of splitting data on classification model is nothing but CA(Classification Accuracy). Here we can see that CA for Test on train data(left side) is greater but we know that is not consider as actual accuracy, we really want our model that can generalize to the every test data.
Cross-validation is a statistical method for estimating machine learning model performance (or accuracy). It is used to prevent overfitting in a prediction model, especially when the amount of data available is restricted. In cross-validation, a set number of folds (or partitions) of the data are created, the analysis is conducted on each fold, and the total error estimate is averaged.
Splitting data into train and test data which we had done above is also type of validation called Holdout validation. One technique to improve the holdout method is to use K-fold cross validation. This strategy ensures that our model’s score is independent of how we chose the train and test set. The data set is subdivided into k subsets, and the holdout approach is applied to each subset k times.
Efficient Use Of Cross-Validation In Orange
For the same workflow using Test and Score widget we can use cross validation by clicking on option of cross validation on the left panel of widget as shown in the below image. Also we are able to change the value K folds.
As we can see we have used K=10 for cross validation so Number of folds are 10. The data set is subdivided into 10 subsets, and the holdout approach is applied to each subset 10 times.
Effects On Model Output & Accuracy
Cross–validation is a method of evaluating a machine learning model’s ability to predict fresh data. It can also be used to detect issues like as overfitting or selection bias, as well as provide information on how the model will generalize to a different dataset. Here instead of single holdout approach it perform K times which provide better Actual accuracy of model. So we can see Cross validation accuracy is less but more accurate or generalize.
Confusion Matrices can also be used for analyzing the output.
This is the practical regarding how visual programming can be performed using Orange3.