Exploring Variational Autoencoders: Generate New Images with Ease
Written on
Introduction
This article delves into Variational Autoencoders (VAE), which belong to the broader category of Deep Generative Models, alongside the well-known GANs (Generative Adversarial Networks).
In contrast to GANs, VAEs employ an Autoencoder framework rather than utilizing a dual Generator-Discriminator setup. Consequently, the concepts underlying VAEs should be relatively easy to grasp, especially for those familiar with Autoencoders.
Feel free to subscribe for email alerts to stay updated on my forthcoming articles regarding Neural Networks, including topics like GANs.
Table of Contents
- The role of VAEs in the realm of Machine Learning algorithms
- An exploration of VAE architecture and functionality
- A detailed Python example illustrating how to construct a VAE using Keras/TensorFlow
The Role of VAEs in Machine Learning Algorithms
The chart below aims to categorize the most prevalent Machine Learning algorithms. This task is challenging due to the multiple dimensions along which we can classify them based on their foundational structures or the specific problems they address.
I have attempted to incorporate both perspectives, leading to the categorization of Neural Networks into a distinct group. Although Neural Networks are predominantly applied in a Supervised manner, it's important to recognize that certain instances, such as Autoencoders, lean towards Unsupervised/Self-Supervised approaches.
Despite Variational Autoencoders (VAE) sharing similar goals with GANs, their structural design aligns more closely with other Autoencoder types like Undercomplete Autoencoders. Thus, you can find VAEs in the Autoencoders section of the interactive chart below.
VAE Architecture and Functionality
Let's begin with an analysis of a standard Undercomplete Autoencoder (AE) before we examine the unique features that differentiate VAEs.
Undercomplete AE
Below is a depiction of a typical AE.
The primary objective of an Undercomplete AE is to effectively encode information from the input data into a lower-dimensional latent space (bottleneck). This is achieved by ensuring the inputs can be reconstructed with minimal loss through a decoder.
It's important to note that during training, the same data set is fed into both the input and output layers as we strive to determine the optimal parameter values for the latent space.
Variational AE
Now, let's investigate how VAEs differ from Undercomplete AEs by examining their architecture:
In VAEs, the latent space comprises distributions rather than discrete point vectors. The inputs are mapped to a Normal distribution, where Z? and Z? represent the mean and variance, respectively, which are learned during model training.
The latent vector Z is sampled from this distribution, utilizing the mean Z? and variance Z?, and is then passed to the decoder to generate the predicted outputs.
Notably, the latent space of a VAE is continuous by design, allowing us to sample from any location within it to produce new outputs (e.g., new images), thus establishing VAE as a generative model.
Regularization Necessity
Encoding inputs into a distribution only partially prepares us for crafting a latent space capable of generating “meaningful” outputs.
To attain the desired regularity, we can introduce a regularization term in the form of the Kullback-Leibler divergence (KL divergence), which will be discussed in further detail in the Python section.
Understanding Latent Space
We can visualize how information is distributed within the latent space with the following illustration.
Mapping data as individual points does not train the model to comprehend the similarities or differences among those points. Therefore, such a space is ineffective for generating new “meaningful” data.
In contrast, Variational Autoencoders map data as distributions and regularize the latent space, which creates a “gradient” or “smooth transition” between distributions. Consequently, sampling a point from this latent space results in new data that closely resembles the training data.
Complete Python Example: Building a VAE with Keras/TensorFlow
Now, we are ready to construct our own VAE!
Setup
We will require the following data and libraries:
- MNIST handwritten digit dataset (copyright held by Yann LeCun and Corinna Cortes under the Creative Commons Attribution-Share Alike 3.0 license; data source: The MNIST Database)
- NumPy for data manipulation
- Matplotlib, Graphviz, and Plotly for visualizations
- TensorFlow/Keras for Neural Networks
Let's import the necessary libraries:
The above code displays the package versions utilized in this example:
TensorFlow/Keras: 2.7.0 numpy: 1.21.4 matplotlib: 3.5.1 graphviz: 0.19.1 plotly: 5.4.0
Next, we will load the MNIST handwritten digit dataset and showcase the first ten digits. Note that we will only utilize digit labels (y_train, y_test) for visualization, not for model training.
As demonstrated, we have 60,000 images in the training set and 10,000 in the test set, each with dimensions of 28 x 28 pixels.
The final setup step involves flattening the images by reshaping them from 28x28 to 784.
Typically, Convolutional layers would be preferred over flattening, particularly for larger images. However, for simplicity, this example will use Dense layers with flat data instead of Convolutional layers.
New shape of X_train: (60000, 784) New shape of X_test: (10000, 784)
Constructing the Variational Autoencoder Model
We will initiate by defining a function that facilitates sampling from the latent space distribution Z.
Here, we employ a reparameterization trick that allows the loss to backpropagate through the mean (z-mean) and variance (z-log-sigma) nodes since they are deterministic.
Simultaneously, we isolate the sampling node by introducing a non-deterministic parameter, epsilon, sampled from a standard Normal distribution.
Now, let's define the structure of the Encoder model.
The code above creates an encoder model and outputs its structural diagram.
Notice that we direct the same outputs from the Encoder-Hidden-Layer-3 into both Z-Mean and Z-Log-Sigma before recombining them within a custom Lambda layer (Z-Sampling-Layer), which is responsible for sampling from the latent space.
Next, we will develop the Decoder model:
The code above creates a decoder model and outputs its structural diagram.
As illustrated, the decoder is a straightforward model that processes inputs from the latent space through a few hidden layers before generating outputs for the 784 nodes.
Next, we will combine the Encoder and Decoder models to form a Variational Autoencoder model (VAE).
If you observed the latent space layers in the Encoder model closely, you would have seen that the encoder generates three outputs: Z-mean [0], Z-log-sigma [1], and Z [2].
The code above connects the models by specifying that the Encoder receives inputs labeled “visible”. Out of the three outputs from the Encoder [0], [1], [2], we pass the third one (Z [2]) into a Decoder, which produces the outputs labeled “outpt”.
Custom Loss Function
Before training the VAE model, the final step is to devise a custom loss function and compile the model.
As previously mentioned, we will utilize KL divergence to gauge the loss between the latent space distribution and a reference standard Normal distribution. The “KL loss” complements the standard reconstruction loss (in this case, MSE) to ensure input and output images remain closely aligned.
Training the VAE Model
With the Variational Autoencoder model assembled, let’s train it for 25 epochs and visualize the loss chart.
Visualizing Latent Space and Generating New Digits
Given that our latent space is two-dimensional, we can visualize the neighborhoods of various digits on the latent 2D plane.
Plotting the digit distribution within the latent space allows us to visually associate different regions with distinct digits.
If we aim to generate a new image of the digit 3, we note that 3s are positioned in the upper middle of the latent space. Thus, we can select the coordinates of [0, 2.5] to generate an image based on those inputs.
As anticipated, we obtained an image closely resembling the digit 3 because we sampled a vector from a region in the latent space associated with 3's.
Now, let's generate 900 new digits across the entire latent space.
The exciting aspect of generating numerous images from the entire latent space is that it enables us to observe the gradual transitions between different shapes. This confirms the successful regularization of our latent space.
Final Thoughts
It's crucial to recognize that Variational Autoencoders can encode and generate significantly more complex data than MNIST digits.
I encourage you to elevate this straightforward tutorial by applying it to real-world datasets relevant to your field.
For your convenience, I have saved a Jupyter Notebook in my GitHub repository that includes all the code presented above.
If you wish to be notified the moment I release a new article on Machine Learning / Neural Networks (e.g., Generative Adversarial Networks (GAN)), please subscribe to receive email updates.
If you're not a Medium member and would like to continue reading articles from countless talented writers, you can join using my personalized link below:
Join Medium with my referral link - Saul Dobilas
As a Medium member, a portion of your membership fee supports the writers you read, granting you full access to every story…
solclover.com
Feel free to reach out if you have any questions or suggestions!
Cheers! Saul Dobilas