Skip to main content

GANS

 

Let's first understand the difference between generative models and Discriminative Models:



In discriminative models, our target is to learn given the set of features X we wish to identify its class. But in the case of the Generative model given the target and some added noise the model should be able to produce the image.
Here we add random noise at the beginning to bring some variation to the images that will be produced by the model.


Generative Models are classified mainly into 2 types:

VAE:

Though our main motive is to learn GANs here let me explain you briefly about VAEs. In this architecture, we have 2 parts that is encoder and the decoder. Both encoder and decoder are typically neural networks that work in the following way. 

Encoder: the main task is to represent an image given to it in some latent space that is given images of side 224x224 it will represent them in a latent space of maybe a length n array. 

Decoder: The task of the Decoder is to take this image in the latent space values of any image and produce the original image back. 

In this way, the training will keep on happening as long as the both encoder and decoder are perfectly trained. Later the encoder is ignored since its task of representing data in some latent space is done. The decoder is used to produce the image.

 It is said to be variational auto-encoder because, rather than encoding into a n-dimensional array it will encodes the image into a particular distribution so that every time we choose an array, it is randomly sampled from the distribution to produce some variation in the image.


GANs:

Gans have a similar kind of structure but they work somewhat differently when compared to VAE.


In GANs we have two parts Generator and the descriminator. They both work as follows

Generator: 
The main task of the Generator is to produce Any realistic image given some random noise and target class(here target class is optional if you are only working on a single class). This is similar to work of the decoder in VAE but here the input is some random noise rather than some n-dimensional representation of the images during training. 


But if we assume single-class training then the generator will focus on learning a distribution like below 

Assume it as a 3d distribution where the notch is coming at you.

Discriminator:
The task of the discriminator is to take the original image and the image generated by the generator and try to identify the fake and real images. The inputs to the discriminator are some bunch of fake and real images mixed up and the output of the discriminator is to identify which ones are real and which ones are fake.


At the end of each iteration, the model is given the feedback of which ones are real and which ones are fake. This feedback helps the discriminator to improve over time by training to reduce its prediction loss. Similarly, this feedback can also help the generator to produce more realistic images. This can be understood as follows.


here this percentage of realness helps the generator by giving it the direction for moving the gradients and reducing loss by producing more realistic images. Whereas in the case of discriminator thought the percentage of realness is 60 or 95 it is marked as fake as long as we user is satisfied by the output we got.


These models are trained simultaneously with each other they are often known to fight with each other which is why they are called "Adversarial". Here fight in the scene generator keeps on trying to produce images in a way that the discriminator fails to distinguish between fake and real images. Whereas the discriminator tries to do quiet the opposite.

And once the training is done we don't further need the discriminator since the generator is good enough to produce realistic images.

NOTE: 
One significant problem in this could be the distribution of data in your training. Consider the case of the cat's dataset below where some breeds are more common than others which pushes the generator to produce the image of the same breed most of the time.



if you see the image in 3-d you can clearly understand the problem that we are looking at. The cone-like structure that will form on the top depicts it.

BCE Cost function:


From the image, we can see what each value in the formula is representing. BCE loss is useful when we have two target labels(0 or 1). 

The first term in the formula is gonna work as follows 




Here 0 represents fake and 1 represents real. The first part says that if the true value is 0 and the model produces any value we don't care, but if the true value is one the first part only produces zero when the model prediction is close to 1, if it is close to zero the loss is pushed to negative infinity to make the generator know that there is significant error. 

    


The second term works in the quite opposite way, It only cares when the target label is fake, that is when the target label is 0 and the model output is close to 0 it won't push the loss but if the produced output is close to 1 then it will push the loss to negative infinity. This part favors the discriminator by letting it know that the image is fake but you are misclassifying it as real.

Note that the negative results produced by these terms are canceled by the sign in front of the formula to make loss look like a large positive value that needs to be reduced.


Training of Discriminator:


In the case of the discriminator, the model is trained as above an image is generated by the generator, and the image is passed with some set of real photos and passed to the discriminator and the discriminator will compute the BCE loss and update the weights. 


Training of Generator:


In the case of a generator once the image is produced the image is passed to the discriminator and the result from the discriminator is then re-evaluated the BCE loss but this time all the target values are set to 1. So that the generator focuses more on producing realistic images. 

Note:
It is important to understand that at a time, only one is trained while the other remains frozen. It is important to make sure that training happens perfectly so that nothing is over-trained making the other fail to improve. Consider the most often happening scenario below 


If you have a superior discriminator that says the model is 100 percent fake for every image produced by the generator then there is no way to move the gradients(ex: if the first image produced by the generator got a 5% real score and after the weight update, the second image produced 3% real score then the generator will know which direction to take. but in case every time it says the image is 100% fake then it cannot improve.)  Similarly, the vice versa is true. But in real life, we will face Superior discriminator problems since training a binary classifier could be easier than producing an image.

Mode Collapse:

Data can have multiple modes(that is the points that occur more often in data).

This is the same in the case of hadden written digits case where we will have 10 modes representing 0 to 9. For simplicity let's assume we only need two features to represent these values below 


During training let's assume the discriminator is now got at identifying fakes and reals in all the digits except for the numbers 1 and 7. 

with this information being passed to the generator the generator then learns to produce more 1's and 7's since its optimal target is to improve the BCE score. And maybe in later iterations, the model might realize producing only 1's has given 100% efficiency then it always tries to produce those values. This problem is called mode collapse where the discriminator is stuck at local minima. In this case, even in the future the discriminator moves away from global minima the generator now will lose a proper direction to move weights and improve or it might stuck at other global minima. 

It was discovered that this issue is mainly caused by BCE loss 

The process that the generator And discriminator are doing is called the Min-max game where the generator wants to maximize the cost in a way that tries to make the discriminator fail at its task and the discriminator wants to improve at its task.

The final optimal goal is to 


We already addressed this problem earlier




During the beginning of the training, the discriminator is not so sure and doesn't have a clear idea of which images are real and which images are fake, so it is quite obvious that the gradients produced during the initial stage can guide the generator in a proper way 


As the training progresses the discriminator learns to delineate between real and fake distributions and the gradient now doesn't give us any useful feedback for achieving the right direction 

Earth Movers Distance:



The main idea of Earth Mover's distance is to calculate the effort that is the distance and the amount needed to move the fake distribution to resemble the real distribution. The vanishing gradient problem here is now eliminated since there are no limits in the range. 

W-loss:


Here the w-loss approximates the earth mover's distance between real and fake distribution. where the discriminator wants to maximize the distance between the real that is represented by c(x) and the fake/predicted that is represented by c(g(z)). Whereas for the generator its opposite  

Here discriminator is called a critic since it is no longer discriminating between 0 and 1. And here that output is not bounded 


Condition for W-loss: 

For using W-loss one condition is that the critic function needs to be 1-L continues that is from any point in the function if we try to draw the -1 and 1 gradient as above all the other points should be within the range which means that the norm of its gradient needs to be at most 1 and its slope can't be greater than 1.

Ways to implement Condition on W-loss:

Weight Clipping:

After we have updated the weights we will clip any weights that are outside some interval. However, forcing the weights to be in a certain range can cause several effects on model learning by pushing it to bind its learning, by controlling the weights it can take.


Gradient Penalty:

Here we try to add a penalizing term rather than clipping the weights. This regularization term penalizes the function whenever the norm of the gradient exceeds 1. However, it is not practical to check every point in the training data. Rather we try to follow the below approach 


we will interpolate real and fake images and then we will try to evaluate the following 


As you can see from the image. we trying to penalize the loss function by making it much larger when the norm value is beyond 1. This square term adds much more impact to the loss function indicating going with a gradient with a norm greater than 1 is penalizable. This is far better than weight clipping where we control weights to be in a limit, Since here we are not penalizing the the loss function for every term.

Conditional GAN:

Conditional GAN is an approach in which we input the required label of the data along with the noise vector this required label is the target class of the image that we want our generator to produce. It works like a vending machine where we choose whatever we want.

This class information is fed to both the generator and Discriminator or critic 

For Generator, it is passed as one-hot vector 


We may think about how the generator can produce the correct image of a particular class even if we give it as input. This can be achieved by providing class information to the discriminator during training which pushes the generator to learn the right class information for particular category images. 


As you can see in the image, though the picture looks real due to misclassification in the class the output is determined as fake. For discriminator, one way of passing input is as follows 


As shown in the above image, the class information here is encoded as separate channels along with RGB channels. Here all other classes except the target class are padded with 0's with same size as the image size and the target is padded as 1's in a separate channel. The approach here may be space-expensive, but there are other optimal ways like using other heads for separate attention and training or using other efficient methods to represent the data.

Controllable Generation:

Controllable generation is somewhat similar to conditional generation but different in its own way. Controllable generation is a process that is performed after the training rather than during the training. In controllable generation, we often don't need a label dataset since it is mainly focused on how to modify the noise vector in a way we can achieve the required features. The best example of this is human face generation using a generator, in which we modify the noise vector in some way to make the person in the image look older or younger or produce their image with glasses etc.. 


Before 

After modifying the noise vector 


Let's see how it typically works 

Controllable generation works somewhat like interpolation, let's see some examples of interpolation to understand it.


let's assume we are representing our data in the form of the vectors as above we can try to interpolate the two images by linearly interpolating them as follows 


After interpolation of these intermediate vectors, we can get those corresponding images from the generator.

So in terms of controllable generation, we try to find that direction represented by the pink line. so that by doing that change of values of the noise vector in direction with a magnitude 'd', we can obtain our target change of hair color for the image.




Challenges with Controllable Generation:

 Feature correlation :

One of the major issues that we face in the controllable generation is feature correlation. In this when we try to change one particular feature by adding magnitude 'd' in some direction. However, due to the correlation of the features in the dataset, some other features will also get affected producing altogether some unexpected outcomes as shown below.


Here when we try to add a beard to a woman's face the expected output is something like that on the top. However, due to the correlation of features beard with masculinity, we are getting a different picture altogether which is not desirable. 

Z-space Entanglement:

It is similar to the feature correlation the main difference here is whether there is some correlation or not trying to change one feature will indirectly affect the other features which may not be a desirable outcome.

This issue usually occurs when there are not enough vectors representing each feature in the noise vector.

Training :

Training of these controllable gans can be achieved by using some sort of classifier that can be used to update these noise vectors in a way that we obtain our required result. One possible example can be as follows. Note that here the weights of the generator are undisturbed.


Evaluation of GANs and challenges:

In reality, there is no perfect measure to evaluate a GAN, Because GANs are not simple classifiers to evaluate against their true labels. There are 2 major evaluation parameters that one looks for in these generative models.

Fidelity: This describes how realistic the image is and how good the quality of the image is.

One approach for fidelity could be taking some n-real and fake samples and comparing how good/different the fake sample is to the nearest real sample.

Diversity: This checks how well the model is diversifying the produced images across the training distribution rather than producing the same images.


Similar to fidelity for diversity we can measure how different the distribution of n-real versus n-fake samples is.

Both these features are very crucial because one without the other is useless. 


Ways to Major distances for fake and real images:

Pixel distance: This a straightforward method where we subtract each pixel value between fake and real and identify the error as the total difference between each pixel value.


Though this is a simple approach it can be more sensitive to smaller changes as shown below

Assuming the image is of large size this small shift in the pixels does not really impact the way the image looks in reality but the pixel difference can be drastically large. So we have opted for feature difference.

Feature distance 
This is an approach where we extract the key features of the images we are looking at and based on those features we will the difference between the images.

Though both images convey the same information('golden retriever dog'), the pixel distance due to the background and other color differences in the image will cause a high pixel distance. But if we use feature distance it would be way better.

Feature Extraction:


The feature extraction can be done using a pre-trained model trained on a vast amount of image data, such as ImageNet. These models can take an input image, and the features are extracted after the final pooling layer. These n-features in the final pooling layer help represent the input image, and the difference between the two images can now be calculated using these extracted features. Here we can opt to use any earlier layers to extract features other than the final pooling layer.

Inception-v3 and feature Embedding:


As discussed earlier we will extract features using a pre-trained model trained on ImageNet, here we have the inception v3 version where given a picture we will extract a 2048X1 dimensional vector to represent the features present in the image. This is typically very useful compared to the pixel distance, which is computationally expensive if the image size is large, apart from that it is illogical to use pixel distance because it can easily affected by the background rather than the main target in the image. 

Now that we have extracted features How can we measure this feature distance?

There are two major approaches to attaining this 

Inception Score:

In this approach, we try to evaluate the performance based on two features that were discussed previously: fidelity and diversity.

In this rather than using feature embedding we try to use the entire Inception model to evaluate the score.

Assume that the inception model produces the above results, identifying the particular image as a dog with a high probability then it indicates high fidelity. That is due to the clarity and quality of the image the model can produce a good spike at some particular class. 

In the same way from a diversity standpoint, we all want the model to produce some diverse output by not producing a single spike but rather producing some evenly distributed spikes across different classes in overall distribution.

So at the core, we want low entropy for p(y/x) indicating some highly identifiable peak or target value, and apart from that we also want high entropy in terms of the diversity that is (p(y)) for the entire data.

We use KL divergence for calculating the inception score as follows 




The ideal scenario here could be the fake images each should have a distinct label but when coming to overall data then it should have a diverse distribution. This indirectly means the KL divergence between P(y/x) and P(y) should be higher.



The Inception Score is calculated by summing scores (KL divergence) over all images and averaging across all classes. Finally, an exponentiation step is applied to maintain meaningful results within a specific range. Scores typically range from 0 up to a maximum of 1000, reflecting the number of available classes.

A higher score indicates a better-performing model, while a lower score suggests either uniform distributions (high entropy) or distributions with pronounced peaks (low entropy).

Drawbacks:

1. If the gan generates one realistic image for each class then it might get a perfect score since it is attaining both fidelity and diversity but in reality, we want the model to produce different different images for each class.

2. The other major issue is, that the inception score will only look at generated images, it doesn't bother about the real images which is not so feasible. This issue is tackled by the FID score below.

3. Since it is a feature-based classification method, what if the features are in the wrong place still they are classified correctly making it worse. This is also the issue with FID.



4. What if the generated image has lots of objects in it then the P(y/x) entropy will be high, which makes us feel that it is a bad score, but actually it isn't  

FID (Frechet Inception Distance):

Given two normal distributions, its FID is calculated as shown in the picture. Let's extend the same thing to Multivariate Normal Distribution.

This is somewhat similar to the univariate normal FID but since here we have multivariate we are using norm and covariance.

Let's see how this works with a simple example. Assume we have 50k real and 50k fake image samples. Then first we need to construct 50k real samples as one distribution and 50k fake samples as another distribution.

So, if we assume inception feature embedding we will have a 2048 feature vector for each image. Mx can be computed as follows

Similarly, we can generate for My. Now let's see how we calculate covariance for the distribution x

Similar to the mean here for each image we will subtract 2048 pixel values with their means computed above and for each image, we will get a 2048X2048 matrix which is then summated across all the 50k images, and each value in the matrix is derived with 49999, leaving us with 2048X2048 covariance matrix.

Note that here square root is not the square root over each element it is the square root over the matrix that is computed differently. And "Tr" indicates a trace of the matrix.  The mathematical part may not be crucial but just letting you know.

Note for FID the lower the score the better the model is. Since this is a difference-based method there is no limit on the upper bound. 

Drawbacks of FID score:

1. The pre-trained model we have opted for may not be suitable for every kind of dataset we work with. For example, if you are working on the MNIST dataset, Inception is not suitable because it is trained on real images.

2. The FID score requires a large sample size. Therefore, it can be biased even when using the same model with both a small and a large sample. For larger sample sizes the same model can get a better score which may not imply the true performance of the model.

3. Slow to run.

4. Limited statistics are used, typically mean and covariance, assuming a normal distribution. However, this assumption may not always hold true in real-time scenarios.

Sampling and Truncating:

Till now we have talked about sampling 50k real and 50k fake images for FID and all but how do we really sample them? 
Since the sampling has a lot of impact on model score this is one of the crucial parts. For real images, we sample randomly and uniformly across all the classes but how do we sample for fake images.

For fakes, we sample z(noise vectors) values based on the training distribution of z values or the prior distribution of P(z). Typically we use normal prior for the distribution of z with mean=0 and standard deviation 1. This makes the images produced due to the z vector near 0 to be more clear and have greater fidelity since they are trained more because they occur more. But if we sample them for FID then we will lose diversity since they only concentrate on fidelity and vice versa.

Truncation trick:


The truncation trick is performed during training 



Comments