Data Science

How To Train Supervised Machine Learning Algorithms Without Labeled Data?

Using AutoEncoders

7 min readSep 6, 2021

Introduction

Garbage in, Garbage out!

There is a reason why this expression is so popular within the data science community.
The very first step of any data science project, or machine learning design is to get quality data. This is performed using different tools, including web scraping, polls, measures from sensors, etc. The data is divided into two parts: Features and Labels. Features are the inputs of the machine learning algorithm. They will be used to predict the labels after the training process. During the training step (in supervised learning), the algorithm adapts gradually its parameters so the outputs resemble the provided labels.
One of the most costly things in data collection is labeling. It involves usually human manipulation or annotation. Machine learning (and deep learning) algorithms require a large amount of data to perform optimal training and be able to generalize. This prevents also overfitting to happen.

But what if the data is not labeled?

When the data is missing a part or all labels, it is virtually impossible to perform supervised training. We must find a way to create targets, using one of the following approaches:

Manual labeling: performed by experts, in-house data labelers, or using crowdsourcing (like Amazon Mechanical Turk)
Automated labeling: before training the ML algorithm, labels are created using unsupervised training techniques.

In practice, the amount of required data to train efficiently deep learning algorithms is increasing. Requiring 50k, 100k, or millions of labeled data points is not unusual in computer vision or natural language processing.

Label creation using deep autoencoders

In this article, we will go through the implementation of a labeling method, based on deep autoencoders. This unsupervised (or self-supervised) learning technique aims to learn the data structure by trying to regenerate the data. An autoencoder consists of two parts: encoder and decoder.

Autoencoder architecture [image by author]

The encoder part transforms the input data (the features) into a representation (called also an encoding) via multiple non-linear transformations. The trick is that the network learns at the same time the inverse transformation in order to reconstruct the data from its representation. The inverse transformation is the decoder. When the autoencoder is trained properly, the data structure is learned. We could use the encoder then to generate labels, by replacing the decoder with a simple classifier that we train on the available labels.

This technique requires some labels. As we will see later, we only need a fraction of labels that can be obtained manually or via crowdsourcing for a much lower cost.

The MNIST dataset

In this article, we will use the MNIST dataset to experiment with the efficiency of autoencoders in label creation. The MNIST dataset is a large database of handwritten digits and corresponding numbers. It contains 60000 grayscale images for training and 10000 images for validation/testing. It is considered as the Hello World of computer vision machine learning. This dataset can be obtained via multiple sources. We will use Tensorflow/Keras datasets to load the data.

The x_train tensor contains 60000 matrices of dimension 28x28. Each element of these matrices represents the brightness of a single pixel. The y_train tensor contains the labels for each image. We start by normalizing the data and reshaping the matrices to a format that can be used by matplotlib for visualization. A one-hot encoding will be performed as well, on the labels. This will be used later during the validation/testing.

Let’s show a random sample from the data:

The images are in low resolution (28x28). Since they are handwritten, many variations of the same digit can be encountered. The ML algorithm will need a lot of data points to be able to generalize!

Design and validation of a data labeling process

In this section, we will implement a data labeling process, based on a deep autoencoder. The method is described in the following diagram:

First (the middle column) we will train a random forest classifier using all the training datasets, in order to set a reference for comparison.
- In the second step (the right column) a similar classifier will be trained on a fraction of the data (in this article 1%). The goal is to quantify the impact of the dataset size on the overall classifier performance.
- Lastly (the left column) a data labeling process will be used to generate labels and train a classifier. We will compare the performance to the two previous implementations.

Random forest classifier training using the entire dataset

Using Scikit learn, we can implement a random forest classifier. No hyperparameter tuning will be performed.

After the training step, we measure the average accuracy of the classifier on the test data.

Not bad! 90% of accuracy using a classic machine learning approach on a computer vision problem is reasonable. We could obtain better results using deep learning and convolutional nets, but this is not the scope of this article.
Here are some predictions on the test data:

The classifier is able to predict the labels in most cases.
The next step consists of training a similar architecture classifier, with less training data.

Random forest classifier training using a fraction of the dataset

Let’s suppose that MNIST doesn’t provide labels with the images. If we have to create targets manually, we will have to repeat 60000 times this operation!
If we assume that, overall, it takes 5 seconds per image to open the file, and record the label, the operation will require more than 83 hours… there must be a better way!
In this article, we will label manually 1% of the data. 600 labels will take less than one hour to create.

The score dropped from 90% to 47%. The classifier is simply useless in this case.

The predictions are not correct in most cases. We have to come up with more labels.

Autoencoder design using convolutional neural networks

The training of autoencoder for label generation is divided into two steps:

Training an autoencoder for label generation [image by author]

We train initially the encoder and the decoder on the features, to learn data structure. After that, the decoder is replaced by a classification layer. The resulting system (encoder+classification layer) is trained on the available labels (1% in our case). During the training, only the classification layer’s weights are optimized. In fact, the encoder is already trained and doesn’t require further learning. Since the classification layer is not complex, we could obtain good results using a small fraction of labels.

We use deep convolutional networks to build the encoder. These neural networks are suitable for image data because they extract useful features and are robust against shifting patterns. They require fewer parameters than dense networks.

The decoder does the inverse transformation. It reconstructs the image using the decoder output. A similar architecture is used.

The autoencoder is built using the encoder and the decoder.

The training is performed using the training features as inputs AND outputs. At the end of learning, the autoencoder will be able to reconstruct MNIST images. We use 20% of the data for validation, and early stop callback to prevent overfitting.

To check the performance of the autoencoder we pick a random image from the test data. The network will output a similar image (80% accuracy based on validation accuracy).

At this point, all we need is to add a dense layer with 10 outputs (the number of classes) to the encoder.

The weights of the encoder are frozen. The training will affect only the dense layer.

Now we train the resulting network labeler on the 1% data that have labels.

Training will stop quickly due to the low number of data points. The categorical accuracy on validation data is around 85%. Keep in mind that we used 1% to train this network, and 20% of that data was used for validation. This means that we achieved 85% accuracy using 0.8% of labels!
Let’s evaluate the label generator on the test data and the real training labels.

Random forest classifier training using generated labels

New labels are generated using the network and used to train a random forest classifier.

After the training step, we evaluate the accuracy of the classifier on unseen data (test data).

81% of accuracy! using less than 1% of labels. This technique improved the quality of prediction from 47% to 81%, that’s +73%.
One might wonder about the usefulness of training a random forest on generated labels while we already have a label generation procedure that shows good performances? There are multiple reasons in fact:

To have a fair comparison, we had to compare similar classification architectures.
Neural networks are not always an option, because of their complex implementation.

Conclusions and future work

In this article, we designed and validated an autoencoder-based label generation method. To confirm the efficiency of this design, we tested on the MNIST dataset with only 1% training labels. We achieved 81% accuracy, using a random forest classification.
The encoder was based on convolutional neural networks. This architecture is used widely in computer vision problems.
For the sake of simplicity, we didn’t perform any hyperparameter tuning. This allowed us to make a fair comparison between all the scenarios.

In a future article, more details of the implementation of convolution will be discussed. We will consider other classification/regression problems involving regular feedforward networks and recurrent neural nets.

References

The notebook [https://jovian.ai/kara-mounir/label-generation]

My Github [https://github.com/zaitrik]

My LinkedIn [https://www.linkedin.com/in/mounir-kara-zaitri-a01a00208/]

Géron, Aurélien. (2019). Hands-on machine learning with Scikit-Learn, Keras and TensorFlow: concepts, tools, and techniques to build intelligent systems (2nd ed.). O’Reilly.