Skip to main content
Log in

Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we present the Lipschitz regularization theory and algorithms for a novel Loss-Sensitive Generative Adversarial Network (LS-GAN). Specifically, it trains a loss function to distinguish between real and fake samples by designated margins, while learning a generator alternately to produce realistic samples by minimizing their losses. The LS-GAN further regularizes its loss function with a Lipschitz regularity condition on the density of real data, yielding a regularized model that can better generalize to produce new data from a reasonable number of training examples than the classic GAN. We will further present a Generalized LS-GAN (GLS-GAN) and show it contains a large family of regularized GAN models, including both LS-GAN and Wasserstein GAN, as its special cases. Compared with the other GAN models, we will conduct experiments to show both LS-GAN and GLS-GAN exhibit competitive ability in generating new images in terms of the Minimum Reconstruction Error (MRE) assessed on a separate test set. We further extend the LS-GAN to a conditional form for supervised and semi-supervised learning problems, and demonstrate its outstanding performance on image classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. The minus sign exists as the U is maximized over \(f_w\) in the WGAN. On the contrary, in the GLS-GAN, \(S_C\) is minimized over \(L_\theta \).

References

  • Arjovsky, M., & Bottou, L. (2017). Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862

  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. arXiv preprint arXiv:1701.07875

  • Arora, S., Ge, R., Liang, Y., Ma, T., & Zhang, Y. (2017). Generalization and equilibrium in generative adversarial nets (GANs). arXiv preprint arXiv:1703.00573

  • Border, K. C. (1989). Fixed point theorems with applications to economics and game theory. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Carando, D., Fraiman, R., & Groisman, P. (2009). Nonparametric likelihood based estimation for a multivariate lipschitz density. Journal of Multivariate Analysis, 100(5), 981–992.

    Article  MathSciNet  Google Scholar 

  • Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems (pp. 2172–2180).

  • Coates, A., & Ng, A. Y. (2011). Selecting receptive fields in deep networks. In Advances in neural information processing systems (pp. 2528–2536).

  • Denton, E. L., Chintala, S., Fergus, R., et al. (2015). Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems (pp. 1486–1494).

  • Dosovitskiy, A., Fischer, P., Springenberg, J. T., Riedmiller, M., & Brox, T. (2015). Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 1734–1747.

    Article  Google Scholar 

  • Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1538–1546).

  • Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., & Courville, A. (2016). Adversarially learned inference. arXiv preprint arXiv:1606.00704

  • Edraki, M., Qi, & G. J. (2018). Generalized loss-sensitive adversarial learning with manifold margins. In Proceedings of the European conference on computer vision (ECCV) (pp. 87–102).

  • Gatys, L. A., Ecker, A. S., & Bethge, M. (2015). A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672–2680).

  • Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015). Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623

  • Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028

  • Hui, K. Y. (2013). Direct modeling of complex invariances for visual object features. In International conference on machine learning (pp. 352–360).

  • Im, D. J., Kim, C. D., Jiang, H., & Memisevic, R. (2016). Generating images with recurrent adversarial networks. arXiv preprint arXiv:1602.05110

  • Kingma, D., & Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  • Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. In Advances in neural information processing systems (pp. 3581–3589).

  • Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  • Krizhevsky, A. (2009). Learning multiple layers of features from tiny images.

  • Laine, S., & Aila, T. (2016). Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242

  • Maaløe, L., Sønderby, C. K., Sønderby, S. K., & Winther, O. (2016). Auxiliary deep generative models. arXiv preprint arXiv:1602.05473

  • Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

  • Miyato, T., Maeda, S.i., Koyama, M., Nakae, K., & Ishii, S. (2015). Distributional smoothing by virtual adversarial examples. arXiv preprint arXiv:1507.00677

  • Nagarajan, V., & Kolter, J. Z. (2017). Gradient descent gan optimization is locally stable. In Advances in neural information processing systems (pp. 5585–5595).

  • Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning 2011. http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.

  • Nowozin, S., Cseke, B., & Tomioka, R. (2016). f-GAN: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709

  • Odena, A. (2016). Semi-supervised learning with generative adversarial networks. arXiv preprint arXiv:1606.01583

  • Qi, G. J., Zhang, L., Hu, H., Edraki, M., Wang, J., & Hua, X. S. (2018). Global versus localized generative adversarial nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1517–1525).

  • Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434

  • Rasmus, A., Berglund, M., Honkala, M., Valpola, H., & Raiko, T. (2015). Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems (pp. 3546–3554).

  • Sajjadi, M., Javanmardi, M., & Tasdizen, T. (2016). Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in neural information processing systems (pp. 1163–1171).

  • Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., & Radford, A. (2016). Chen, X.: Improved techniques for training GANs. In Advances in neural information processing systems (pp. 2226–2234).

  • Springenberg, J. T. (2015). Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390

  • Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems (pp. 1195–1204).

  • Valpola, H. (2015). From neural pca to deep unsupervised learning. In Advances in independent component analysis and learning machines (pp. 143–171). Academic Press.

  • Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390.

    Article  Google Scholar 

  • Yadav, A., Shah, S., Xu, Z., Jacobs, D., & Goldstein, T. (2017). Stabilizing adversarial nets with prediction methods. arXiv preprint arXiv:1705.07364

  • Zhao, J., Mathieu, M., Goroshin, R., & Lecun, Y. (2015). Stacked what-where auto-encoders. arXiv preprint arXiv:1506.02351

  • Zhao, J., Mathieu, M., & LeCun, Y. (2016). Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126

  • Zhao, Y., Jin, Z., Qi, G. J., Lu, H., & Hua, X. S. (2018). An adversarial approach to hard triplet generation. In Proceedings of the European conference on computer vision (ECCV) (pp. 501–517).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guo-Jun Qi.

Additional information

Communicated by Patrick Perez.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Lemma 1

To prove Lemma 1, we need the following lemma.

Lemma 5

For two probability densities \(p({\mathbf {x}})\) and \(q({\mathbf {x}})\), if \(p({\mathbf {x}})\ge \eta q({\mathbf {x}})\) almost everywhere, we have

$$\begin{aligned} \int _{\mathbf {x}} |p({\mathbf {x}})-q({\mathbf {x}})|d{\mathbf {x}} \le \dfrac{2(1-\eta )}{\eta } \end{aligned}$$

for \(\eta \in (0,1]\).

Proof

We have the following equalities and inequalities:

$$\begin{aligned} \begin{aligned} \int _{{\mathbf {x}}} |p({\mathbf {x}})-q({\mathbf {x}})|d{\mathbf {x}}&=\int _{\mathbf {x}} \mathbb {1}_{\left[ p({\mathbf {x}})\ge q({\mathbf {x}})\right] } (p({\mathbf {x}})-q({\mathbf {x}}))d{\mathbf {x}}\\&\quad +\,\int _{\mathbf {x}} \mathbb {1}_{\left[ p({\mathbf {x}})<q({\mathbf {x}})\right] } (q({\mathbf {x}})-p({\mathbf {x}}))d{\mathbf {x}}\\&=\int _{\mathbf {x}} (1-\mathbb {1}_{\left[ p({\mathbf {x}})< q({\mathbf {x}})\right] }) (p({\mathbf {x}})-q({\mathbf {x}}))d{\mathbf {x}}\\&\quad +\,\int _{\mathbf {x}} \mathbb {1}_{\left[ p({\mathbf {x}})<q({\mathbf {x}})\right] } (q({\mathbf {x}})-p({\mathbf {x}}))d{\mathbf {x}}\\&=2\int _{\mathbf {x}} \mathbb {1}_{\left[ p({\mathbf {x}})< q({\mathbf {x}})\right] } (q({\mathbf {x}})-p({\mathbf {x}}))d{\mathbf {x}}\\&\le 2\left( \dfrac{1}{\eta }-1\right) \int _{\mathbf {x}} \mathbb {1}_{\left[ p({\mathbf {x}})< q({\mathbf {x}})\right] } p({\mathbf {x}}) d{\mathbf {x}}\\&\le \dfrac{2(1-\eta )}{\eta } \end{aligned} \end{aligned}$$
(16)

This completes the proof. \(\square \)

Now we can prove Lemma 1.

Proof

Suppose \((\theta ^*,\phi ^*)\) is a Nash equilibrium for the problems (4) and (5).

Then, on one hand, we have

$$\begin{aligned} \begin{aligned} S(\theta ^*,\phi ^*)&\ge \mathop {\mathbb {E}}\limits _{{\mathbf {x}}\sim P_{data}({\mathbf {x}})} L_{\theta ^*}({\mathbf {x}})\\&\quad +\, \lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}})\\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}} \big (\varDelta ({\mathbf {x}}, {\mathbf {z}}_G) +L_{\theta ^*}({\mathbf {x}})-L_{\theta ^*}({\mathbf {z}}_G)\big )\\&=\int _{\mathbf {x}} P_{data}({\mathbf {x}}) L_{\theta ^*}({\mathbf {x}}) d{\mathbf {x}} + \lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}}) \\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}} \varDelta ({\mathbf {x}}, {\mathbf {z}}_G) \\&\quad +\,\lambda \int _{\mathbf {x}} P_{data}({\mathbf {x}}) L_{\theta ^*}({\mathbf {x}}) d{\mathbf {x}} \\&\quad -\, \lambda \int _{{\mathbf {z}}_G} P_{G^*}({\mathbf {z}}_G) L_{\theta ^*}({\mathbf {z}}_G) d {\mathbf {z}}_G\\&=\int _{\mathbf {x}} \big ((1+\lambda )P_{data}({\mathbf {x}})- \lambda P_{G^*}({\mathbf {x}})\big )L_{\theta ^*}({\mathbf {x}}) d{\mathbf {x}} \\&\quad +\,\lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}}) \\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}}\varDelta ({\mathbf {x}}, {\mathbf {z}}_G) \end{aligned} \end{aligned}$$
(17)

where the first inequality follows from \((a)_+\ge a\).

We also have \(T(\theta ^*,\phi ^*)\le T(\theta ^*,\phi )\) for any \(G_\phi \) as \(\phi ^*\) minimizes \(T(\theta ^*,\phi )\). In particular, we can replace \(P_G({\mathbf {x}})\) in \(T(\theta ^*,\phi )\) with \(P_{data}({\mathbf {x}})\), which yields

$$\begin{aligned} \mathop \int \limits _{{\mathbf {x}}}L_{\theta ^*}({\mathbf {x}})P_{G^*}({\mathbf {x}})d{\mathbf {x}} \le \mathop \int \limits _{{\mathbf {x}}}L_{\theta ^*}({\mathbf {x}})P_{data}({\mathbf {x}})d{\mathbf {x}}. \end{aligned}$$

Applying this inequality into (17) leads to

$$\begin{aligned} S(\theta ^*,\phi ^*)\ge & {} \int _{\mathbf {x}} P_{data}({\mathbf {x}})L_{\theta ^*}({\mathbf {x}}) d{\mathbf {x}} +\lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}}) \\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}}\varDelta ({\mathbf {x}}, {\mathbf {z}}_G)\nonumber \\\ge & {} \lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}}) \\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}}\varDelta ({\mathbf {x}}, {\mathbf {z}}_G) \end{aligned}$$
(18)

where the last inequality follows as \(L_\theta ({\mathbf {x}})\) is nonnegative.

On the other hand, consider a particular loss function

$$\begin{aligned} L_{\theta _0}({\mathbf {x}})=\alpha \big (-(1+\lambda ) P_{data}({\mathbf {x}})+\lambda P_{G^*}({\mathbf {x}})\big )_+ \end{aligned}$$
(19)

When \(\alpha \) is a sufficiently small positive coefficient, \(L_{\theta _0}({\mathbf {x}})\) is a nonexpansive function (i.e., a function with Lipschitz constant no larger than 1.). This follows from the assumption that \(P_{data}\) and \(P_G\) are Lipschitz. In this case, we have

$$\begin{aligned} \varDelta ({\mathbf {x}}, {\mathbf {z}}_G) +L_{\theta _0}({\mathbf {x}})-L_{\theta _0}({\mathbf {z}}_G) \ge 0 \end{aligned}$$
(20)

By placing this \(L_{\theta _0}({\mathbf {x}})\) into \(S(\theta ,\phi ^*)\), one can show that

$$\begin{aligned} \begin{aligned} S(\theta _0,\phi ^*)&=\int _{\mathbf {x}} \big ((1+\lambda )P_{data}({\mathbf {x}})- \lambda P_{G^*}({\mathbf {x}})\big )L_{\theta _0}({\mathbf {x}}) d{\mathbf {x}}\\&\quad +\,\lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}}) \\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}}\varDelta ({\mathbf {x}}, {\mathbf {z}}_G)\\&=-\alpha \mathop \int \limits _{{\mathbf {x}}} \big (-(1+\lambda )P_{data}({\mathbf {x}})+\lambda P_{G^*}({\mathbf {x}})\big )_+^2 d{\mathbf {x}}\\&\quad +\,\lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}}) \\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}}\varDelta ({\mathbf {x}}, {\mathbf {z}}_G)\\ \end{aligned} \end{aligned}$$

where the first equality uses Eq. (20), and the second equality is obtained by substituting \(L_{\theta _0}({\mathbf {x}})\) in Eq. (19) into the equation.

Assuming that \((1+\lambda )P_{data}({\mathbf {x}})- \lambda P_{G^*}({\mathbf {x}})<0\) on a set of nonzero measure, the above equation would be strictly upper bounded by \(\lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}}) \\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}}\varDelta ({\mathbf {x}}, {\mathbf {z}}_G)\) and we have

$$\begin{aligned} S(\theta ^*,\phi ^*)\le S(\theta _0,\phi ^*)<\lambda \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}({\mathbf {x}}) \\ {\mathbf {z}}_G\sim P_{G^*}({\mathbf {z}}_G) \end{array}}\varDelta ({\mathbf {x}}, {\mathbf {z}}_G) \end{aligned}$$
(21)

This results in a contradiction with Eq. (18). Therefore, we must have

$$\begin{aligned} P_{data}({\mathbf {x}}) \ge \dfrac{\lambda }{1+\lambda }P_{G^*}({\mathbf {x}}) \end{aligned}$$
(22)

for almost everywhere. By Lemma 5, we have

$$\begin{aligned} \int _{{\mathbf {x}}}|P_{data}({\mathbf {x}})-P_{G^*}({\mathbf {x}})|d{\mathbf {x}} \le \dfrac{2}{\lambda } \end{aligned}$$

Let \(\lambda \rightarrow +\infty \), this leads to

$$\begin{aligned} \int _{{\mathbf {x}}}|P_{data}({\mathbf {x}})-P_{G^*}({\mathbf {x}})|d{\mathbf {x}} \rightarrow 0 \end{aligned}$$

This proves that \(P_{G^*}({\mathbf {x}})\) converges to \(P_{data}({\mathbf {x}})\) as \(\lambda \rightarrow +\infty \). \(\square \)

Proof of Lemma 3

Proof

Suppose a pair of \((f_w^*, g_\phi ^*)\) jointly solve the WGAN problem.

Then, on one hand, we have

$$\begin{aligned} U(f_w^*,g_\phi ^*)=\int _x f_w^*({\mathbf {x}}) P_{data}({\mathbf {x}})d{\mathbf {x}} - \int _x f_w^* ({\mathbf {x}})P_{g_\phi ^*}({\mathbf {x}})d{\mathbf {x}}\le 0 \end{aligned}$$
(23)

where the inequality follows from \(V(f_w^*,g_\phi ^*)\ge V(f_w^*,g_\phi )\) by replacing \(P_{g_\phi }({\mathbf {x}})\) with \(P_{data}({\mathbf {x}})\).

Consider a particular \(f_w({\mathbf {x}})\triangleq \alpha (P_{data}({\mathbf {x}})-P_{g_\phi ^*}({\mathbf {x}}))_+\). Since \(P_{data}({\mathbf {x}})\) and \(P_{g_\phi ^*}\) are Lipschitz by assumption, when \(\alpha \) is sufficiently small, it can be shown that \(f_w({\mathbf {x}})\in {\mathcal {L}}_1\).

Substituting this \(f_w\) into \(U(f_w, g_\phi ^*)\), we get

$$\begin{aligned} U(f_w,g_\phi ^*)=\alpha \int _{\mathbf {x}} (P_{data}({\mathbf {x}})-P_{g_\phi ^*}({\mathbf {x}}))_+^2 d{\mathbf {x}} \end{aligned}$$

Let us assume \(P_{data}({\mathbf {x}}) > P_{g_\phi ^*}({\mathbf {x}})\) on a set of nonzero measure, we would have

$$\begin{aligned} U(f_w^*,g_\phi ^*)\ge U(f_w,g_\phi ^*) > 0 \end{aligned}$$

This leads to a contradiction with (23), so we must have

$$\begin{aligned} P_{data}({\mathbf {x}}) \le P_{g_\phi ^*}({\mathbf {x}}) \end{aligned}$$

almost everywhere.

Hence, by Lemma 5, we prove the conclusion that

$$\begin{aligned} \int _{{\mathbf {x}}}|P_{data}({\mathbf {x}})-P_{g_\phi ^*}({\mathbf {x}})|d{\mathbf {x}}=0. \end{aligned}$$

\(\square \)

Proof of Theorem 2

For simplicity, throughout this section, we disregard the first loss minimization term in \(S(\theta ,\phi ^*)\) and \(S_m(\theta ,\phi ^*)\), since the role of the first term would vanish as \(\lambda \) goes to \(+\infty \). However, even if it is involved, the following proof still holds with only some minor changes.

To prove Theorem 2, we need the following lemma.

Lemma 6

For all loss functions \(L_\theta \), with at least the probability of \(1-\eta \), we have

$$\begin{aligned} |S_m(\theta ,\phi ^*)-S(\theta ,\phi ^*)|\le \varepsilon \end{aligned}$$

when the number of samples

$$\begin{aligned} m\ge \dfrac{C B_\varDelta ^2(\kappa +1)^2 \big (N \log (\kappa _L N/\varepsilon )+\log (1/\eta )\big )}{\varepsilon ^2} \end{aligned}$$

with a sufficiently large constant C.

The proof of this lemma needs to apply the McDiarmid’s inequality and the fact that \((\cdot )_+\) is an 1-Lipschitz to bound the difference \(|S_m(\theta ,\phi ^*)-S(\theta ,\phi ^*)|\) for a loss function. Then, to get the union bound over all loss functions, a standard \(\epsilon \)-net (Arora et al. 2017) will be constructed to yield finite points that are dense enough to cover the parameter space of the loss functions. The proof details are given below.

Proof

For a loss function \(L_\theta \), we compute \(S_m(\theta ,\phi ^*)\) over a set of m samples \(\{({\mathbf {x}}_i, {{\mathbf {z}}_G}_i)|1\le i\le m\}\) drawn from \(P_{data}\) and \(P_{G^*}\) respectively.

To apply the McDiarmid’s inequality, we need to bound the change of this function when a sample is changed. Denote by \(S^i_m(\theta ,\phi ^*)\) when the jth sample is replaced with \({\mathbf {x}}'_i\) and \({{\mathbf {z}}'_G}_i\). Then we have

$$\begin{aligned} \begin{aligned} |S_m(\theta ,\phi ^*)-S^i_m(\theta ,\phi ^*)|&=\dfrac{1}{m}|\big (\varDelta ({\mathbf {x}}_i, {{\mathbf {z}}_G}_i)+L_\theta ({\mathbf {x}}_i)-L_\theta ({{\mathbf {z}}_G}_i)\big )_+\\&\quad -\,\big (\varDelta ({\mathbf {x}}'_i, {{\mathbf {z}}'_G}_i)+L_\theta ({\mathbf {x}}'_i)-L_\theta ({{\mathbf {z}}'_G}_i)\big )_+|\\&\le \dfrac{1}{m}|\varDelta ({\mathbf {x}}_i, {{\mathbf {z}}_G}_i)-\varDelta ({\mathbf {x}}'_i, {{\mathbf {z}}'_G}_i)|\\&\quad +\,\dfrac{1}{m}|L_\theta ({\mathbf {x}}_i)-L_\theta ({\mathbf {x}}'_i)|\\&\quad +\,\dfrac{1}{m}|L_\theta ({{\mathbf {z}}_G}_i)-L_\theta ({{\mathbf {z}}'_G}_i)|\\&\le \dfrac{1}{m}\big (2B_\varDelta +\kappa \varDelta ({\mathbf {x}}_i,{\mathbf {x}}'_i) + \kappa \varDelta ({{\mathbf {z}}_G}_i,{{\mathbf {z}}'_G}_i)\big )\\&\le \dfrac{2}{m}(1+\kappa )B_\varDelta \end{aligned} \end{aligned}$$

where the first inequality uses the fact that \((\cdot )_+\) is 1-Lipschitz, the second inequality follows from that \(\varDelta ({\mathbf {x}}, {\mathbf {z}}_G)\) is bounded by \(B_\varDelta \) and \(L_\theta ({\mathbf {x}})\) is \(\kappa \)-Lipschitz in \({\mathbf {x}}\).

Now we can apply the McDiarmid’s inequality. Noting that

$$\begin{aligned} S(\theta ,\phi ^*)=\mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}_i\sim P_{data}\\ {{\mathbf {z}}_G}_i\sim P_G\\ i=1,\ldots ,m \end{array}}S_m(\theta ,\phi ^*), \end{aligned}$$

we have

$$\begin{aligned} P(|S_m(\theta ,\phi ^*)-S(\theta ,\phi ^*)|\ge \varepsilon /2)\le 2\exp \left( -\dfrac{\varepsilon ^2 m}{8(1+\kappa )^2B_\varDelta ^2}\right) \end{aligned}$$
(24)

The above bound applies to a single loss function \(L_\theta \). To get the union bound, we consider a \(\varepsilon /8\kappa _L\)-net \({\mathcal {N}}\), i.e., for any \(L_\theta \), there is a \(\theta '\in {\mathcal {N}}\) in this net so that \(\Vert \theta -\theta '\Vert \le \varepsilon /8\kappa _L\). This standard net can be constructed to contain finite loss functions such that \(|{\mathcal {N}}|\le O(N\log (\kappa _L N/\varepsilon ))\), where N is the number of parameters in a loss function. Note that we implicitly assume the parameter space of the loss function is bounded so we can construct such a net containing finite points here.

Therefore, we have the following union bound for all \(\theta \in {\mathcal {N}}\) that, with probability \(1-\eta \),

$$\begin{aligned} |S_m(\theta ,\phi ^*)-S(\theta ,\phi ^*)|\le \dfrac{\varepsilon }{2} \end{aligned}$$

when \(m\ge \dfrac{C B_\varDelta ^2(\kappa +1)^2 \big (N \log (\kappa _L N/\varepsilon )+\log (1/\eta )\big )}{\varepsilon ^2}\).

The last step is to obtain the union bound for all loss functions beyond \({\mathcal {N}}\). To show that, we consider the following inequality

$$\begin{aligned} \begin{aligned}&|S(\theta ,\phi ^*)-S(\theta ',\phi ^*)|\\&\quad =|\mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}\\ {\mathbf {z}}_G\sim P_G \end{array}} \big (\varDelta ({\mathbf {x}}, {\mathbf {z}}_G)+ L_{\theta }({\mathbf {x}})- L_{\theta }({\mathbf {z}}_G)\big )_+\\&\qquad -\,\mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data}\\ {\mathbf {z}}_G\sim P_G \end{array}} \big (\varDelta ({\mathbf {x}}, {\mathbf {z}}_G)+ L_{\theta '}({\mathbf {x}})- L_{\theta '}({\mathbf {z}}_G)\big )_+|\\&\quad \le \mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {x}}\sim P_{data} \end{array}}|L_{\theta }({\mathbf {x}})-L_{\theta '}({\mathbf {x}})|\\&\qquad +\,\mathop {\mathbb {E}}\limits _{\begin{array}{c} {\mathbf {z}}_G\sim P_G \end{array}}|L_{\theta }({\mathbf {z}}_G)-L_{\theta '}({\mathbf {z}}_G)|\\&\quad \le 2\kappa _L\Vert \theta -\theta '\Vert \end{aligned} \end{aligned}$$

where the first inequality uses that fact that \((\cdot )_+\) is 1-Lipschitz again, and the second inequality follows from that \(L_\theta \) is \(\kappa _L\)-Lipschitz in \(\theta \). Similarly, we can also show that

$$\begin{aligned} |S_m(\theta ,\phi ^*)-S_m(\theta ',\phi ^*)| \le 2\kappa _L\Vert \theta -\theta '\Vert \end{aligned}$$

Now we can derive the union bound over all loss functions. For any \(\theta \), by construction we can find a \(\theta '\in {\mathcal {N}}\) such that \(\Vert \theta -\theta '\Vert \le \varepsilon /8\kappa _L\). Then, with probability \(1-\eta \), we have

$$\begin{aligned} \begin{aligned}&|S_m(\theta ,\phi ^*)-S(\theta ,\phi ^*)|\\&\quad \le |S_m(\theta ,\phi ^*)-S_m(\theta ',\phi ^*)| +|S_m(\theta ',\phi ^*)-S(\theta ',\phi ^*)|\\&\qquad +\,|S(\theta ',\phi ^*)-S(\theta ,\phi ^*)|\\&\quad \le 2\kappa _L\Vert \theta -\theta '\Vert +\dfrac{\varepsilon }{2}+2\kappa _L\Vert \theta -\theta '\Vert \\&\quad \le \dfrac{\varepsilon }{4}+\dfrac{\varepsilon }{2}+\dfrac{\varepsilon }{4}=\varepsilon \end{aligned} \end{aligned}$$

This proves the lemma. \(\square \)

Now we can prove Theorem 2.

Proof

First let us bound \(S_m-S\). Consider \(L_{\theta ^*}\) that minimizes \(S(\theta ,\phi ^*)\). Then with probability \(1-\eta \), when \(m\ge \dfrac{C N B_\varDelta ^2(\kappa +1)^2\log (\kappa _L N/\eta \varepsilon )}{\varepsilon ^2}\), we have

$$\begin{aligned} S_m - S \le S_m(\theta ^*,\phi ^*) - S(\theta ^*,\phi ^*)\le \varepsilon \end{aligned}$$

where the first inequality follows from the inequality \(S_m\le S_m(\theta ^*,\phi ^*)\) as \(\theta ^*\) may not minimize \(S_m\), and the second inequality is a direct application of the above lemma. Similarly, we can prove the other direction. With probability \(1-\eta \), we have

$$\begin{aligned} S - S_m \ge S(\theta ^*,\phi ^*)- S_m(\theta ^*,\phi ^*) \ge \varepsilon \end{aligned}$$

Finally, a more rigourous discussion about the generalizability should consider that \(G_{\phi ^*}\) is updated iteratively. Therefore we have a sequence of \(G_{\phi ^*}^{(t)}\) generated over T iterations for \(t=1,\ldots ,T\). Thus, a union bound over all generators should be considered in (24), and this makes the required number of training examples m become

$$\begin{aligned} m\ge \dfrac{C B_\varDelta ^2(\kappa +1)^2 \big (N \log (\kappa _L N/\varepsilon )+\log (T/\eta )\big )}{\varepsilon ^2}. \end{aligned}$$

However, the iteration number T is usually much smaller than the model size N (which is often hundreds of thousands), and thus this factor will not affect the above lower bound of m. \(\square \)

Proof of Theorem 4 and Corollary 2

We prove Theorem 4 as follows.

Proof

First, the existence of a minimizer follows from the fact that the functions in \({\mathcal {F}}_\kappa \) form a compact set, and the objective function is convex.

To prove the minimizer has the two forms in (10), for each \(L_\theta \in {\mathcal {F}}_\kappa \), let us consider

$$\begin{aligned} \widehat{L}_{\theta }({\mathbf {x}})= & {} \max _{1\le i\le n+m}\big \{\big (L_\theta ({\mathbf {x}}^{(i)})-\kappa \varDelta ({\mathbf {x}},{\mathbf {x}}^{(i)})\big )_+\big \},\\ \widetilde{L}_{\theta }({\mathbf {x}})= & {} \min _{1\le i\le n+m}\big \{L_\theta ({\mathbf {x}}^{(i)})+\kappa \varDelta ({\mathbf {x}},{\mathbf {x}}^{(i)})\} \end{aligned}$$

It is not hard to verify that \(\widehat{L}_\theta ({\mathbf {x}}^{(i)})=L_\theta ({\mathbf {x}}^{(i)})\) and \(\widetilde{L}_\theta ({\mathbf {x}}^{(i)})=L_\theta ({\mathbf {x}}^{(i)})\) for \(1\le i\le n+m\).

Indeed, by noting that \(L_{\theta }\) has its Lipschitz constant bounded by \(\kappa \), we have \(L_{\theta }({\mathbf {x}}^{(j)})-L_{\theta }({\mathbf {x}}^{(i)})\le \kappa \varDelta ({\mathbf {x}}^{(i)},{\mathbf {x}}^{(j)})\), and thus

$$\begin{aligned} L_{\theta }({\mathbf {x}}^{(j)})-\kappa \varDelta ({\mathbf {x}}^{(i)},{\mathbf {x}}^{(j)})\le L_{\theta }({\mathbf {x}}^{(i)}) \end{aligned}$$

Because \(L_{\theta }({\mathbf {x}}^{(i)})\ge 0\) by the assumption (i.e., it is lower bounded by zero), it can be shown that for all j

$$\begin{aligned} \big (L_{\theta }({\mathbf {x}}^{(j)})-\kappa \varDelta ({\mathbf {x}}^{(i)},{\mathbf {x}}^{(j)})\big )_+\le L_{\theta }({\mathbf {x}}^{(i)}). \end{aligned}$$

Hence, by the definition of \(\widehat{L}_{\theta }({\mathbf {x}})\) and taking the maximum over j on the left hand side, we have

$$\begin{aligned} \widehat{L}_{\theta }({\mathbf {x}}^{(i)})\le L_{\theta }({\mathbf {x}}^{(i)}) \end{aligned}$$

On the other hand, we have

$$\begin{aligned} \widehat{L}_{\theta }({\mathbf {x}}^{(i)}) \ge L_{\theta }({\mathbf {x}}^{(i)}) \end{aligned}$$

because \(\widehat{L}_{\theta }({\mathbf {x}}) \ge \big (L_\theta ({\mathbf {x}}^{(i)})-\kappa \varDelta ({\mathbf {x}},{\mathbf {x}}^{(i)})\big )_+\) for any \({\mathbf {x}}\), and it is true in particular for \({\mathbf {x}}={\mathbf {x}}^{(i)}\). This shows \(\widehat{L}_{\theta }({\mathbf {x}}^{(i)}) = L_{\theta }({\mathbf {x}}^{(i)})\).

Similarly, one can prove \(\widetilde{L}_\theta ({\mathbf {x}}^{(i)})=L_\theta ({\mathbf {x}}^{(i)})\). To show this, we have

$$\begin{aligned} L_\theta ({\mathbf {x}}^{(j)})+\kappa \varDelta ({\mathbf {x}}^{(i)},{\mathbf {x}}^{(j)})\ge L_\theta ({\mathbf {x}}^{(i)}) \end{aligned}$$

by the Lipschitz continuity of \(L_\theta \). By taking the minimum over j, we have

$$\begin{aligned} \widetilde{L}_\theta ({\mathbf {x}}^{(i)}) \ge L_\theta ({\mathbf {x}}^{(i)}). \end{aligned}$$

On the other hand, we have \(\widetilde{L}_\theta ({\mathbf {x}}^{(i)})\le L_\theta ({\mathbf {x}}^{(i)})\) by the definition of \(\widetilde{L}_\theta ({\mathbf {x}}^{(i)})\). Combining these two inequalities shows that \(\widetilde{L}_\theta ({\mathbf {x}}^{(i)})=L_\theta ({\mathbf {x}}^{(i)})\).

Now we can prove for any function \(L_\theta \in {\mathcal {F}}_\kappa \), there exist \(\widehat{L}_\theta \) and \(\widetilde{L}_\theta \) both of which attain the same value of \(S_{n,m}\) as \(L_\theta \), since \(S_{n,m}\) only depends on the values of \(L_\theta \) on the data points \(\{{\mathbf {x}}^{(i)}\}\). In particular, this shows that any global minimum in \({\mathcal {F}}_\kappa \) of \(S_{n,m}\) can also be attained by the corresponding functions of the form (10). By setting \(l^*_i=\widehat{L}_{\theta ^*}({\mathbf {x}}^{(i)})=\widetilde{L}_{\theta ^*}({\mathbf {x}}^{(i)})\) for \(i=1,\ldots ,n+m\), this completes the proof. \(\square \)

Finally, we prove Corollary 2 that bounds \(L_\theta \) with \(\widehat{L}_\theta ({\mathbf {x}})\) and \(\widetilde{L}_\theta ({\mathbf {x}})\) constructed above.

Proof

By the Lipschitz continuity, we have

$$\begin{aligned} L_\theta ({\mathbf {x}}^{(i)}) - \kappa \varDelta ({\mathbf {x}},{\mathbf {x}}^{(i)}) \le L_\theta ({\mathbf {x}}) \end{aligned}$$

Since \(L_\theta ({\mathbf {x}})\ge 0\), it follows that

$$\begin{aligned} \big (L_\theta ({\mathbf {x}}^{(i)}) - \kappa \varDelta ({\mathbf {x}},{\mathbf {x}}^{(i)})\big )_+ \le L_\theta ({\mathbf {x}}) \end{aligned}$$

Taking the maximum over i on the left hand side, we obtain

$$\begin{aligned} \widehat{L}_\theta ({\mathbf {x}}) \le L_\theta ({\mathbf {x}}) \end{aligned}$$

This proves the lower bound.

Similarly, we have by Lipschitz continuity

$$\begin{aligned} L_\theta ({\mathbf {x}}) \le \kappa \varDelta ({\mathbf {x}},{\mathbf {x}}^{(i)}) + L_\theta ({\mathbf {x}}^{(i)}) \end{aligned}$$

which, by taking the minimum over i on the left hand side, leads to

$$\begin{aligned} \widetilde{L}_\theta ({\mathbf {x}}) \le L_\theta ({\mathbf {x}}) \end{aligned}$$

This shows the upper bound. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qi, GJ. Loss-Sensitive Generative Adversarial Networks on Lipschitz Densities. Int J Comput Vis 128, 1118–1140 (2020). https://doi.org/10.1007/s11263-019-01265-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01265-2

Navigation