Lecture Notes in Deep Learning: Unsupervised Learning – Part 5

Symbolic picture for the article. The link opens the image in a large view.

Advanced GAN Methods

These are the lecture notes for FAU’s YouTube Lecture “Deep Learning“. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

X-Ray projections created using GANs. Image created using gifify. Source: YouTube

Welcome back to deep learning to the last video where we discussed the different algorithms regarding generative adversarial networks. Today, we want to look into the fifth part of our lecture on unsupervised training. These are essentially tricks of the trade concerning GANs.

One-sided label smoothing helps with training GANs. Image under CC BY 4.0 from the Deep Learning Lecture.

One trick that can help you quite a bit is one-sided label smoothing. So, what you may want to do is replace your targets of the real samples with a smoothed version. So. instead of using a probability of one, you use a 0.9 probability. You do not use the same for the fake samples. You don’t change their label because otherwise, this will reinforce incorrect behavior. So, your generator would produce samples that resemble the data or samples it already makes. Benefits are that you can prevent the discriminator from giving very large gradients to your generator and you also prevent extrapolating to encourage extreme samples.

GANs do not require balancing between the discrimination and the generator. Image under CC BY 4.0 from the Deep Learning Lecture.

Is balancing between the generator and the discriminator necessary? No, it’s not. The GANs work by estimating the ratio of data and model density. So, the ratio is estimated correctly only when the discriminator is optimal. So, it’s fine if your discriminator overpowers the generator. When the discriminator gets too good, your gradients, of course, may vanish. Then, you can use tricks like the non-saturating loss the Wasserstein GANs as we talked about earlier. You may also run into the problem that your generator gradients may get too large. In this case, you can use the trick of label smoothing.

Deep convolutional GANs. Image under CC BY 4.0 from the Deep Learning Lecture.

Of course, you can also work with deep convolutional GANs. So, this is DCGAN where you implement a deep learning approach into the generator. So, you can replace pooling layers with strided convolutions and transpose convolutions. You can fully remove the connected hidden layers for deeper architectures. The generator then typically uses ReLU activations, except for the output layer which uses a hyperbolic tangent. The discriminator for example here uses a leaky ReLU activation for all the layers and they use batch normalization.

Generated images might show strong intra-batch correlation. Image under CC BY 4.0 from the Deep Learning Lecture.

If you do that, then you may end up in the following problem: You can see here some generation results. Within the batches, there may be a very strong intra-batch correlation. So within the batch, all of the generated images look very similar.

Virtual Batch Normalization can help with intra-batch correlations. Image under CC BY 4.0 from the Deep Learning Lecture.

This brings us to the concept of virtual batch normalization. You don’t want to use one batch normalization instance for both mini-batches. You could use two separate batch normalizations or even better, you use the virtual batch normalization. In case this is too expensive, you choose instance normalization for each sample, subtract the mean, and divide by the standard deviation. In case, you choose virtual batch normalization, you create a reference batch R of random samples and fix them once at the start of the training. Then, for each x subscript i of the current mini-batch, you create a new virtual batch that is the reference batch union the x subscript i. Then, you compute the mean and standard deviation of this virtual batch. You always need to propagate then R forward in addition to the current batch. This then allows you to normalize x subscript i with these statistics. So, this may be kind of expensive, but we have seen that this is very useful for stabilizing the training and removing the intra-batch correlations.

Historical averaging can stabilise GAN training. Image under CC BY 4.0 from the Deep Learning Lecture.

There’s also the idea of historical averaging. So there, you add a penalty term that punishes weights which are rather far away from the historical average. This historical average of the parameters can then be updated in an online fashion. Similar tricks from reinforcement learning can also work for generative adversarial networks like experience replay. You keep a replay buffer of past generations and occasionally show them. You keep checkpoints from the past generators and discriminators and occasionally swap them out for a few iterations.

Examples of images created with DCGAN. Image under CC BY 4.0 from the Deep Learning Lecture.

So, if you do so, then you can do things like the DCGAN. Here are bedrooms after just one epoch. You can see that you are able to generate quite a few different bedrooms. So it is very interesting what kind of diversity in terms of generalization you can actually achieve.

Anime characters from GANs. Image created using gifify. Source: YouTube

Another interesting observation is that you can do vector arithmetic on the generated images. So, you can generate for example the mean of three instances of condition “man with glasses”. With this mean, you can subtract for example the mean of “man without glasses” and then you compute the mean of “woman without glasses” and add it on top. What you get is “woman with glasses”.

Conditioning allows this kind of vector arithmetic. Image under CC BY 4.0 from the Deep Learning Lecture.

So, you can really use the constrained generation with this trick in order to generate something where you potentially don’t have a conditioning variable for. So the GANs learn a distribution representation that disentangles the concept of gender from the concept of wearing glasses. If you’re interested in these experiments I recommend looking into [1].

Mode collapse in training will prevent learning the entire distribution. Image under CC BY 4.0 from the Deep Learning Lecture.

So let’s look at a couple of more advanced GAN methods. A typical problem that may occur is mode collapse. So if you try to have this kind of target distribution with several cluster centers that are distributed in a ring-like fashion in 2-D space, then you can see that it may occur that your generator rotates through the modes of the data distribution. So, you do 5,000, steps 10,000 steps, up to the 25,000 steps, and you start jumping between the modes. The problem here is of course that you never converge to a fixed distribution. A possible reason may be that the minimization of G over the maximization of D is not equal to the maximization of D over the minimization of G. So, the discriminator in the inner loop converges to the correct distribution, the generator in the inner loop places all mass on the most likely point. In practice, if you do a simultaneous stochastic gradient descent of both networks, both effects actually can appear. Solutions are to use mini discrimination or unrolled GANs.

Mini-batch discrimination forces dissimilarity within batches. Image under CC BY 4.0 from the Deep Learning Lecture.

So, mini-batch discrimination follows the intuition to allow D to look at multiple samples in combination to help the generator avoid collapsing. So, you extract features from an intermediate layer and then you add a mini-batch layer that computes for each feature the similarity to all other samples of the mini-batch. You concatenate the similarity vector to each feature. Then, you compute these mini-batch features separately for samples from the generator and from the training data. Your discriminator still outputs 0 and 1 but uses the similarity of all samples in the mini-batch as side information. So, this is mini-batch discrimination.

Unrolling the GAN training brings in ideas from training RNNs. Image under CC BY 4.0 from the Deep Learning Lecture.

You can also follow the concept of the unrolled GAN. So, here you would like to have the ideal generator. The problem here is that essentially we ignore the maximum operation when computing the generator’s gradient. Here, the idea is to regard our V as a cost for the generator and you need to backpropagate through the maximization operation.

The unrolling procedure is shown in an example. Image under CC BY 4.0 from the Deep Learning Lecture.

This essentially leads to a very similar concept as we’ve already seen in recurrent neural networks. So, you can unroll over several of those iterations using stochastic gradient descent on the different parameter sets. So, you get this kind of unrolled gradients over the forward pass over several iterations. This then allows us to build a computational graph describing several steps of learning in the discriminator. You backpropagate through all of the steps when you compute the generator’s gradient. Fully maximizing the discriminator’s value function is intractable, of course. So, you have to stop at a low number of k but you can see that already with a low number like 10, this substantially reduces the mode collapse.

Unrolled GANs prevent mode collapse. Image under CC BY 4.0 from the Deep Learning Lecture.

So, here is an example of the unrolled GAN. You have the target function of the standard GAN. You have this alternating mode and with the unrolled GAN, you can then see that in step zero you still have the same distribution. But already in step five thousand, you can see that the distribution is spread over a much larger area. In step number ten thousand, you can see that the entire domain is filled. After fifteen thousand steps, you form a ring. After twenty thousand steps, you can see that our certain maxima are appearing and after twenty-five thousand steps, we manage to mimic the original target distribution.

Portraits created using GANs. Image created using gifify. Source: YouTube

You can also use GANs for semi-supervised learning. So here, the idea is to use the GAN by turning a K class problem into a K+1 class problem. There, you have the true classes which are the target classes for supervised learning.

GANs are also suited for semi-supervised training. Image under CC BY 4.0 from the Deep Learning Lecture.

Then you have some additional class which models the fake inputs that have been generated by our generator G. The probability of being real is essentially the sum of all the real classes and the discriminator is then used as a classifier within the GAN game.

Multi-scale approaches also are applicable for GANs. Image under CC BY 4.0 from the Deep Learning Lecture.

Again, we can also use other ideas from our deep learning class. For example, the Laplacian pyramid. You can also do a Laplacian pyramid of GANs. So, we have observed so far that GANs are pretty good at generating low-resolution images. The high-resolution images are much more difficult. Here, you can then start by generating low-resolution images. So here, we go from right to left. They would start with generating from noise, a small resolution image. Then, you have a generator that takes the generated image as a condition and additional noise to generate an update for the high frequency to upscale the image. Lastly, you can do this again in the next scale where you try to predict again the high frequencies of these images, you add the high frequencies and use this to upscale and again you can get an obscured image that is missing high frequencies. Again, you use this as a conditioning vector on generating the high-frequency image to put back the high frequencies. So step-by-step, we use conditional GANs to generate the missing high frequencies in this context.

The training procedure of LAPGAN. Image under CC BY 4.0 from the Deep Learning Lecture.

You train the generators by training a discriminator on each level. The inputs for the discriminators are the different scaled images. So you downsize the image step-by-step and from the downscaled image you can then compute the difference image. The difference image between the two will give you the correct high frequencies. You train the discriminator in a way that it can differentiate the correctly generated high frequencies from the artificially generated high frequencies. So, this is the LAPGAN training, but we still only have 64 by 64 pixels.

Stacked GANs allow multi-stage GAN processing. Image under CC BY 4.0 from the Deep Learning Lecture.

Another idea is StackGAN. This is now used for generating photorealistic images from a text. So, the task is you have some text and generate a fitting image. You decompose the problem into sketch refinement using a two-stage conditional GAN. The analogon, of course, here is that, for example, in human painting, you would first sketch and then draw the fine details. So, you have two stages. One GAN that draws a low-resolution image. It’s conditioned with the text descriptions and it plans the rough shape, basic colors, and the background from the given text. The Stage-II GAN is then generating the high-resolution images conditioned on Stage-I result and the text descriptions. It corrects for defects and adds details.

Examples from StackGAN. Image under CC BY 4.0 from the Deep Learning Lecture.

So, here are some examples where you have the text description. We are generating birds here from descriptions. You can see that the Stage-I generation still is missing many details. With the Stage-II generation, a lot of the problems that have been caused in Stage-I are fixed and you can see that the new images have a much much higher and better resolution. You can see the whole paper in [20].

Summary of our observations on GANs. Image under CC BY 4.0 from the Deep Learning Lecture.

Let’s summarize: GANs are generative models that use supervised learning to approximate an intractable cost function. You can simulate many different cost functions. It’s actually hard to find an equilibrium between the discriminator and the generator. It cannot generate discrete data, by the way. You can still use it for semi-supervised classification, transfer learning, multimodal outputs, and domain transfer. There are a lot of papers appearing right now. These can also create high-resolution output such as BigGAN, for example, that does really highly resolved images.

The world according to BigGAN. Image created using gifify. Source: YouTube

So next time on deep learning, we have a couple of interesting things coming up. We will go into how to adapt neural networks to localize objects. So, we go into object detection and localization. We will then detect object classes. We try to detect instances of classes. Then, we will go even further and try to segment the outlines of objects.

More exciting things coming up in this deep learning lecture. Image under CC BY 4.0 from the Deep Learning Lecture.

So, we really want to go into image segmentation and not just find objects and images, but really try to identify their outlines. We will also look into one technique very soon that has been cited very, very often. So, you can check the citation count every day and you will see that it has increased.

Comprehensive questions for unsupervised learning. Image under CC BY 4.0 from the Deep Learning Lecture.

We also have some comprehensive questions here: What’s the basic idea of contrastive divergence? What is defining the characteristic of an autoencoder? How to denoising autoencoders work? Also don’t forget about the variational auto-encoders. I think this is a very cool concept and the reparametrization trick tells us how can you backpropagate through a sampling process which is pretty cool. Of course, you should look into GANs. What is an optimal discriminator? What’s the problem of mode collapse? I particularly encourage you to have a look at the Cycle GANs. So Cycle GANs are very, very popular right now. They are really nice because you can tackle things like unpaired domain translation. This is being used in many, many different applications. So, you should know about those methods if you really want to put in your CV that you have experience with deep learning. Then, you should know about such methods. Well, there’s some further reading. There’s a very nice talk about variational auto-encoders that we are linking here. There’s a great tutorial about GANs by Goodfellow – the GAN father. So, I recommend watching this video here. Then, we have some more tricks of the trade that you can find here in GAN hacks. If you ever wanted how to name your GAN, then you could have a look at this reference. Of course, we also have plenty of references and I really recommend looking at all of them. So, thank you very much for watching this video. I hope you liked our small summary of generative adversarial networks and looking forward to meeting you in the next one. Thank you very much and bye-bye!

If you liked this post, you can find more essays here, more educational material on Machine Learning here, or have a look at our Deep LearningLecture. I would also appreciate a follow on YouTubeTwitterFacebook, or LinkedIn in case you want to be informed about more essays, videos, and research in the future. This article is released under the Creative Commons 4.0 Attribution License and can be reprinted and modified if referenced. If you are interested in generating transcripts from video lectures try AutoBlog.

Links

Link – Variational Autoencoders:
Link – NIPS 2016 GAN Tutorial of Goodfellow
Link – How to train a GAN? Tips and tricks to make GANs work (careful, not
everything is true anymore!)
Link - Ever wondered about how to name your GAN?

References

[1] Xi Chen, Xi Chen, Yan Duan, et al. “InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 2016, pp. 2172–2180.
[2] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, et al. “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion”. In: Journal of Machine Learning Research 11.Dec (2010), pp. 3371–3408.
[3] Emily L. Denton, Soumith Chintala, Arthur Szlam, et al. “Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks”. In: CoRR abs/1506.05751 (2015). arXiv: 1506.05751.
[4] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern classification. 2nd ed. New York: Wiley-Interscience, Nov. 2000.
[5] Asja Fischer and Christian Igel. “Training restricted Boltzmann machines: An introduction”. In: Pattern Recognition 47.1 (2014), pp. 25–39.
[6] John Gauthier. Conditional generative adversarial networks for face generation. Mar. 17, 2015. URL: http://www.foldl.me/2015/conditional-gans-face-generation/ (visited on 01/22/2018).
[7] Ian Goodfellow. NIPS 2016 Tutorial: Generative Adversarial Networks. 2016. eprint: arXiv:1701.00160.
[8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, et al. “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium”. In: Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 2017, pp. 6626–6637.
[9] Geoffrey E Hinton and Ruslan R Salakhutdinov. “Reducing the dimensionality of data with neural networks.” In: Science 313.5786 (July 2006), pp. 504–507. arXiv: 20.
[10] Geoffrey E. Hinton. “A Practical Guide to Training Restricted Boltzmann Machines”. In: Neural Networks: Tricks of the Trade: Second Edition. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 599–619.
[11] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, et al. “Image-to-Image Translation with Conditional Adversarial Networks”. In: (2016). eprint: arXiv:1611.07004.
[12] Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: arXiv e-prints, arXiv:1312.6114 (Dec. 2013), arXiv:1312.6114. arXiv: 1312.6114 [stat.ML].
[13] Jonathan Masci, Ueli Meier, Dan Ciresan, et al. “Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction”. In: Artificial Neural Networks and Machine Learning – ICANN 2011. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 52–59.
[14] Luke Metz, Ben Poole, David Pfau, et al. “Unrolled Generative Adversarial Networks”. In: International Conference on Learning Representations. Apr. 2017. eprint: arXiv:1611.02163.
[15] Mehdi Mirza and Simon Osindero. “Conditional Generative Adversarial Nets”. In: CoRR abs/1411.1784 (2014). arXiv: 1411.1784.
[16] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial 2015. eprint: arXiv:1511.06434.
[17] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. “Improved Techniques for Training GANs”. In: Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 2016, pp. 2234–2242.
[18] Andrew Ng. “CS294A Lecture notes”. In: 2011.
[19] Han Zhang, Tao Xu, Hongsheng Li, et al. “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks”. In: CoRR abs/1612.03242 (2016). arXiv: 1612.03242.
[20] Han Zhang, Tao Xu, Hongsheng Li, et al. “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks”. In: arXiv preprint arXiv:1612.03242 (2016).
[21] Bolei Zhou, Aditya Khosla, Agata Lapedriza, et al. “Learning Deep Features for Discriminative Localization”. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, June 2016, pp. 2921–2929. arXiv: 1512.04150.
[22] Jun-Yan Zhu, Taesung Park, Phillip Isola, et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”. In: CoRR abs/1703.10593 (2017). arXiv: 1703.10593.