gdyshi | 孤独隐士 MindMap, Machine Learning, Medical Equipment

译文 Generative Adversarial Nets


摘要

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.我们提出了一个通过对抗过程来估计生成模型的新框架,其中我们同时训练两个模型:捕获数据分布的生成模型G和估计样本来自训练数据的概率的判别模型D. G的训练程序是最大化D犯错误的概率。 该框架对应于minimax双人游戏。 在任意函数G和D的空间中,存在唯一的解决方案,其中G恢复训练数据分布并且D等于1/2。 在G和D由多层感知器定义的情况下,整个系统可以用反向传播进行训练。 在训练或生成样本期间不需要任何马尔可夫链或展开的近似推断网络。 实验通过对生成的样品进行定性和定量评估来证明该框架的潜力。

1. 介绍

The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 22]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [19, 9, 10] which have a particularly well-behaved gradient . Deep generative models have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties. 深度学习的前景是发现丰富的层次模型[2],它们表示人工智能应用中遇到的各种数据的概率分布,如自然图像,包含语音的音频波形和自然语言语料库中的符号。到目前为止,深度学习中最引人注目的成功涉及辨别模型,通常是那些将高维度,丰富的感官输入映射到类标签的模型[14,22]。这些惊人的成功主要基于反向传播和丢失算法,使用分段线性单元[19,9,10],它们具有特别良好的梯度。但对深度生成模型的影响较小,因为在最大似然估计及相关策略上,很多概率计算的模拟非常难,并且由于难以在生成模型中利用分段线性单元的好处。我们提出了一种新的生成模型估计程序,可以避免这些困难。

In the proposed adversarial nets framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.在提出的对抗性网络框架中,生成模型与对手进行竞争:一种判别模型,用于学习确定样本是来自模型分布还是数据分布。 生成模型可以被认为类似于造假者团队,试图生产虚假货币并在没有检测的情况下使用它,而歧视模型类似于警察,试图检测伪造货币。 在这个游戏中的竞争促使两个团队改进他们的方法,直到假冒伪劣品与真品无法区分。

This framework can yield specific training algorithms for many kinds of model and optimization algorithm. In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the discriminative model is also a multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train both models using only the highly successful backpropagation and dropout algorithms [17] and sample from the generative model using only forward propagation. No approximate inference or Markov chains are necessary.该框架可以为多种模型和优化算法提供特定的训练算法。 在本文中,我们探讨了生成模型通过多层感知器传递随机噪声生成样本的特殊情况,并且判别模型也是多层感知器。 我们将这种特殊情况称为对抗性网络。 在这种情况下,我们可以仅使用非常成功的反向传播和丢失算法[17]来训练两个模型,并且仅使用前向传播来生成来自生成模型的样本。 不需要近似推理或马尔可夫链。

2. 相关工作

An alternative to directed graphical models with latent variables are undirected graphical models with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann machines (DBMs) [26] and their numerous variants. The interactions within such models are represented as the product of unnormalized potential functions, normalized by a global summation/integration over all states of the random variables. This quantity (the partition function) and its gradient are intractable for all but the most trivial instances, although they can be estimated by Markov chain Monte Carlo (MCMC) methods. Mixing poses a significant problem for learning algorithms that rely on MCMC [3, 5].具有潜在变量的定向图形模型的替代方案是具有潜变量的无向图形模型,例如受限制的玻尔兹曼机(RBM)[27,16],深玻尔兹曼机(DBM)[26]及其众多变体。 这些模型中的相互作用表示为非标准化势函数的乘积,通过随机变量的所有状态的全局求和/积分来归一化。 这个数量(分区函数)及其梯度对于除了最平凡的实例之外的所有实例都是难以处理的,尽管它们可以通过马尔可夫链蒙特卡罗(MCMC)方法来估计。 混合对于依赖MCMC的学习算法提出了一个重要挑战[3,5]。

Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and several directed layers. While a fast approximate layer-wise training criterion exists, DBNs incur the computational difficulties associated with both undirected and directed models.深度置信网络(DBN)[16]是包含单个无向层和若干有向层的混合模型。 虽然存在快速近似分层训练标准,但DBN会引起与无向和定向模型相关的计算困难。

Alternative criteria that do not approximate or bound the log-likelihood have also been proposed, such as score matching [18] and noise-contrastive estimation (NCE) [13]. Both of these require the learned probability density to be analytically specified up to a normalization constant. Note that in many interesting generative models with several layers of latent variables (such as DBNs and DBMs), it is not even possible to derive a tractable unnormalized probability density. Some models such as denoising auto-encoders [30] and contractive autoencoders have learning rules very similar to score matching applied to RBMs. In NCE, as in this work, a discriminative training criterion is employed to fit a generative model. However, rather than fitting a separate discriminative model, the generative model itself is used to discriminate generated data from samples a fixed noise distribution. Because NCE uses a fixed noise distribution, learning slows dramatically after the model has learned even an approximately correct distribution over a small subset of the observed variables.还提出了不接近或约束对数似然的替代标准,例如得分匹配[18]和噪声对比估计(NCE)[13]。这两者都要求将学习的概率密度分析地指定为归一化常数。请注意,在许多具有多层潜在变量(例如DBN和DBM)的有趣生成模型中,甚至不可能得出易处理的非标准化概率密度。某些模型(如去噪自动编码器[30]和收缩自动编码器)的学习规则与应用于RBM的分数匹配非常相似。在NCE中,与本研究一样,采用判别训练标准来拟合生成模型。然而,生成模型本身不是用于拟合单独的判别模型,而是用于将生成的数据与样本区分为固定的噪声分布。因为NCE使用固定的噪声分布,所以在模型在一小部分观察变量上学习了大致正确的分布后,学习速度显着减慢。

Finally, some techniques do not involve defining a probability distribution explicitly, but rather train a generative machine to draw samples from the desired distribution. This approach has the advantage that such machines can be designed to be trained by back-propagation. Prominent recent work in this area includes the generative stochastic network (GSN) framework [5], which extends generalized denoising auto-encoders [4]: both can be seen as defining a parameterized Markov chain, i.e., one learns the parameters of a machine that performs one step of a generative Markov chain. Compared to GSNs, the adversarial nets framework does not require a Markov chain for sampling. Because adversarial nets do not require feedback loops during generation, they are better able to leverage piecewise linear units [19, 9, 10], which improve the performance of backpropagation but have problems with unbounded activation when used ina feedback loop. More recent examples of training a generative machine by back-propagating into it include recent work on auto-encoding variational Bayes [20] and stochastic backpropagation [24].最后,一些技术不涉及明确定义概率分布,而是训练生成机器从所需分布中抽取样本。这种方法的优点是可以将这种机器设计成通过反向传播进行训练。最近在该领域的突出工作包括生成随机网络(GSN)框架[5],它扩展了广义去噪自动编码器[4]:两者都可以看作是定义参数化马尔可夫链,即学习机器的参数执行生成马尔可夫链的一步。与GSN相比,对抗性网络框架不需要马尔可夫链进行采样。因为对抗网在生成期间不需要反馈回路,所以它们能够更好地利用分段线性单元[19,9,10],这提高了反向传播的性能,但是当在反馈回路中使用时存在无界激活的问题。最近通过反向传播训练生成模型的例子包括最近关于变分贝叶斯[20]和随机反向传播[24]的自动编码工作。

3. 对抗网络

The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons. To learn the generator’s distribution pg over data x, we define a prior on input noise variables pz(z), then represent a mapping to data space as G(z; g), where G is a differentiable function represented by a multilayer perceptron with parameters g. We also define a second multilayer perceptron D(x; d) that outputs a single scalar. D(x) represents the probability that x came from the data rather than pg. We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1-D(G(z))):当模型是多层感知器时,对抗建模框架可直接应用。 为了在数据x上学习生成器的分布pg,我们在输入噪声变量pz(z)上定义先验,然后表示到数据空间的映射为G(z;g),其中G是由多层感知器表示的带参数g的可微函数。 我们还定义了第二个多层感知器D(x;d),它输出一个标量。 D(x)表示x来自数据而不是pg的概率。 我们训练D以最大化为训练样本和来自G的样本分配正确标签的概率。我们同时训练G以最小化log(1-D(G(z))):

In other words, D and G play the following two-player minimax game with value function V (G;D):换句话说,D和G使用值函数V(G; D)进行以下双人最大最小极限运动:

equation 1

In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1.在下一节中,我们提出了对抗网络的理论分析,基本上表明训练标准允许人们恢复数据生成分布,因为G和D被赋予足够的容量,即在非参数限制中。有关该方法的不太正式,更具教学意义的解释,请参见图1。在实践中,我们必须使用迭代的数值方法来实现游戏。在训练的内循环中优化D到完成在计算上是禁止的,并且在有限数据集上将导致过度拟合。相反,我们在优化D的k个步骤和优化G的一个步骤之间交替。这导致D保持接近其最优解,只要G变化足够慢。这种策略类似于SML / PCD [31,29]训练将马尔可夫链中的样本从一个学习步骤维持到下一个学习步骤的方式,以避免作为学习内循环的一部分在马尔可夫链中燃烧。该过程在算法1中正式呈现。

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 - D(G(z))) saturates. Rather than training G to minimize log(1 - D(G(z))) we can train G to maximize logD(G(z)). This objective function results in the same fixed point of the dynamics of G and D but provides much stronger gradients early in learning.在实践中,等式1可能无法为G学习提供足够的梯度。 在学习初期,当G很差时,D可以高度自信地拒绝样本,因为它们与训练数据明显不同。 在这种情况下,log(1-D(G(z)))饱和。 我们可以训练G以最大化logD(G(z)),而不是训练G以最小化log(1-D(G(z)))。 该目标函数导致G和D动力学的相同固定点,但在学习早期提供更强的梯度。

figure 1

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg. (a) Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier. (b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D(x) = pdata(x) /(pdata(x)+pg(x)) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likely to be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach a point at which both cannot improve because pg = pdata. The discriminator is unable to differentiate between the two distributions, i.e. D(x) = 1/2 .图1:生成对抗网络通过同时更新判别分布(D,蓝色,虚线)进行训练,以便区分来自数据生成分布(黑色,虚线)px的样本与生成分布pg(G)的样本(绿色,实线)。下面的水平线是从中采样z的域,在这种情况下是均匀的。上面的水平线是x域的一部分。向上箭头表示映射x = G(z)如何在变换样本上施加非均匀分布pg。 G在高密度区域收缩,在低密度区域扩展。 (a)考虑靠近收敛的对抗对:pg类似于pdata,D是部分精确的分类器。 (b)在算法D的内环中,训练D来区分样本和数据,收敛到D?(x)= pdata(x)/(pdata(x)+ pg(x))。 (c)在更新G之后,D的梯度引导G(z)流向更可能被分类为数据的区域。 (d)经过几个步骤的训练后,如果G和D有足够的容量,它们将达到两个都无法改善的点,因为pg = pdata。鉴别器不能区分两种分布,即D(x)= 1/2。

4. 理论结果

The generator G implicitly defines a probability distribution pg as the distribution of the samples G(z) obtained when z ~ pz. Therefore, we would like Algorithm 1 to converge to a good estimator of pdata, if given enough capacity and training time. The results of this section are done in a nonparametric setting, e.g. we represent a model with infinite capacity by studying convergence in the space of probability density functions.生成器G隐含地将概率分布pg定义为当z~1时获得的样本G(z)的分布。 因此,如果给定足够的容量和训练时间,我们希望算法1收敛到pdata的良好估计。 本部分的结果是在非参数设置中完成的,例如 我们通过研究概率密度函数空间中的收敛来表示具有无限容量的模型。

algorithm 1

4.1. pg = pdata的全局最优性

We first consider the optimal discriminator D for any given generator G.我们首先考虑任何给定生成器G的最佳鉴别器D.

Proposition 1. For G fixed, the optimal discriminator D is 命题1.对于G固定,最优鉴别器D是

equation 2

Proof. The training criterion for the discriminator D, given any generator G, is to maximize the quantity V (G;D) 证明。 鉴于任何生成器G,鉴别器D的训练标准是最大化量V(G; D)

equation 3

For any (a, b) 2 R2{0,0}, the function y –>a log(y) + b log(1 - y) achieves its maximum in [0; 1] at a/(a+b) . The discriminator does not need to be defined outside of Supp(pdata) [ Supp(pg), concluding the proof.对于任何(a,b)2 R2 \ {0,0},函数y - > log(y)+ b log(1 - y)在[0,1]范围内的最大值是a/(a + b)。 鉴别器不需要在Supp(pdata)[Supp(pg)之外定义,结论证明。

Note that the training objective for D can be interpreted as maximizing the log-likelihood for estimating the conditional probability P(Y = y x), where Y indicates whether x comes from pdata (with y = 1) or from pg (with y = 0). The minimax game in Eq. 1 can now be reformulated as:注意,D的训练目标可以解释为最大化用于估计条件概率P(Y = y x)的对数似然,其中Y表示x是来自pdata(y = 1)还是来自pg(y = 0)。 方程式 1中的极小极大游戏现在可以重新表述为:

equation 4

Theorem 1. The global minimum of the virtual training criterion C(G) is achieved if and only if pg = pdata. At that point, C(G) achieves the value -log 4.定理1.当且仅当pg = pdata时,实现虚拟训练标准C(G)的全局最小值。 此时,C(G)达到值 -log 4。

Proof. For pg = pdata, DG(x) = 1/2 , (consider Eq. 2). Hence, by inspecting Eq. 4 at DG(x) = 1/2 , we find C(G) = log 1/2 + log 1/2 = -log 4. To see that this is the best possible value of C(G), reached only for pg = pdata, observe that 证明。 对于pg = pdata,D* G(x)= 1/2,(考虑方程2)。 因此,通过检查公式4。 在D* G(x)= 1/2时,我们发现C(G)= log 1/2 + log 1/2 = -log 4.为了看到这是C(G)的最佳可能值, 仅针对pg = pdata,观察到

screen 1

and that by subtracting this expression from C(G) = V (DG,G), we obtain:通过从C(G)= V(D G,G)中减去该表达式,我们得到:

equation 5

where KL is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen–Shannon divergence between the model’s distribution and the data generating process:其中KL是Kullback-Leibler分歧。 我们在前面的表达式中认识到模型分布和数据生成过程之间的Jensen-Shannon差异:

equation 6

Since the Jensen–Shannon divergence between two distributions is always non-negative and zero only when they are equal, we have shown that C = -log(4) is the global minimum of C(G) and that the only solution is pg = pdata, i.e., the generative model perfectly replicating the data generating process.由于两个分布之间的Jensen-Shannon分歧总是非负的,只有当它们相等时才为零,我们已经证明了C = -log(4)是C(G)的全局最小值,唯一的解决方案是pg = pdata,即完全复制数据生成过程的生成模型。

4.2. 算法1的收敛性

Proposition 2. If G and D have enough capacity, and at each step of Algorithm 1, the discriminator is allowed to reach its optimum given G, and pg is updated so as to improve the criterion 命题2.如果G和D具有足够的容量,并且在算法1的每一步,允许鉴别器达到其最佳给定G,并且更新pg以便改进标准

screen 2

then pg converges to pdata 然后pg收敛到pdata

Proof. Consider V (G,D) = U(pg,D) as a function of pg as done in the above criterion. Note that U(pg,D) is convex in pg. The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained. In other words, if f(x) = sup f (x) and f (x) is convex in x for every a , then @f (x) 2 @f if b= arg sup 2A f (x). This is equivalent to computing a gradient descent update for pg at the optimal D given the corresponding G. supD U(pg,D) is convex in pg with a unique global optima as proven in Thm 1, therefore with sufficiently small updates of pg, pg converges to px, concluding the proof.证明。 考虑V(G,D)= U(pg,D)作为pg的函数,如在上述标准中所做的那样。 注意,U(pg,D)在pg中是凸的。 凸函数的上界的子衍生物包括在达到最大值的点处函数的导数。 换句话说,如果f(x)= sup? 对于每个a,f?(x)和f?(x)在x中是凸的,如果b = arg sup?2A f?(x)则是@f(x)2 @f。 这相当于在最优D处计算pg的梯度下降更新给定相应的G. supD U(pg,D)在pg中是凸的,具有在Thm 1中证明的唯一全局最优,因此具有足够小的pg更新, pg汇总到px,得出结论证明。

In practice, adversarial nets represent a limited family of pg distributions via the function G(z; g), and we optimize g rather than pg itself. Using a multilayer perceptron to define G introduces multiple critical points in parameter space. However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.在实践中,对抗网络通过函数G(z;g)表示有限的pg分布族,并且我们优化?g而不是pg本身。 使用多层感知器定义G在参数空间中引入了多个关键点。 然而,多层感知器在实践中的优异性能表明它们是一种合理的模型,尽管它们缺乏理论上的保证。

5. 试验

We trained adversarial nets an a range of datasets including MNIST[23], the Toronto Face Database(TFD) [28], and CIFAR-10 [21]. The generator nets used a mixture of rectifier linear activations [19,9] and sigmoid activations, while the discriminator net used maxout [10] activations. Dropout [17] was applied in training the discriminator net. While our theoretical framework permits the use of dropout and other noise at intermediate layers of the generator, we used noise as the input to only the bottommost layer of the generator network.我们训练了对抗网络的一系列数据集,包括MNIST [23],多伦多人脸数据库(TFD)[28]和CIFAR-10 [21]。 生成网络混合使用RELU激活[19,9]和sigmod激活,而判别网络使用maxout [10]激活。 dropout[17]用于训练判别网络。 虽然我们的理论框架允许在生成网络的中间层使用dropout和其他噪声,但我们使用噪声作为生成网络最底层的输入。

We estimate probability of the test set data under pg by fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution. The  parameter of the Gaussians was obtained by cross validation on the validation set. This procedure was introduced in Breuleux et al. [8] and used for various generative models for which the exact likelihood is not tractable [25, 3, 5]. Results are reported in Table 1. This method of estimating the likelihood has somewhat high variance and does not perform well in high dimensional spaces but it is the best method available to our knowledge. Advances in generative models that can sample but not estimate likelihood directly motivate further research into how to evaluate such models.我们通过将高斯Parzen窗口拟合到用G生成的样本并在该分布下报告对数似然来估计测试集数据在pg下的概率。 通过在验证集上的交叉验证获得高斯参数。 该程序在Breuleux等人的文章中介绍。 [8]并用于各种生成模型,其确切的可能性是不易处理的[25,3,5]。 结果报告在表1中。这种估计可能性的方法具有稍高的方差,并且在高维空间中表现不佳,但它是我们所知的最佳方法。 生成模型的进步(可以采样但不能估计可能性)直接激发了对如何评估此类模型的进一步研究。

table 1

Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean loglikelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we computed the standard error across folds of the dataset, with a different  chosen using the validation set of each fold. On TFD,  was cross validated on each fold and mean log-likelihood on each fold were computed. For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.表1:基于Parzen窗口的对数似然估计。 在MNIST上报告的数字是测试集上样本的平均对数似然,并且在示例中计算平均值的标准误差。 在TFD上,我们计算了数据集折叠的标准误差,并且有不同的6 使用每个折叠的验证集选择。 在TFD上,6在每个折叠上交叉验证并计算每个折叠的平均对数似然。 对于MNIST,我们将与其他数据集的实值(而不是二进制)版本进行比较。

In Figures 2 and 3 we show samples drawn from the generator net after training. While we make no claim that these samples are better than samples generated by existing methods, we believe that these samples are at least competitive with the better generative models in the literature and highlight the potential of the adversarial framework.在图2和图3中,我们显示了训练后从生成网络中抽取的样本。 虽然我们没有声称这些样本比现有方法生成的样本更好,但我们认为这些样本至少与文献中更好的生成模型竞争,并突出了对抗框架的潜力。

figure 2

Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator)图2:来自模型的样本的可视化。 最右边的列显示最近的相邻样本的训练示例,以证明模型没有记住训练集。 样品是公平的随机抽取,而不是挑选。 与深度生成模型的大多数其他可视化不同,这些图像显示来自模型分布的实际样本,而不是给定隐藏单元样本的条件均值。 而且,这些样品是不相关的,因为取样过程不依赖于马尔可夫链混合。 a)MNIST b)TFD c)CIFAR-10(完全连接模型)d)CIFAR-10(卷积鉴别器和“反卷积”发生器)

figure 3

Figure 3: Digits obtained by linearly interpolating between coordinates in z space of the full model.图3:通过在完整模型的z空间中的坐标之间线性插值获得的数字。

6. 优势和劣势

This new framework comes with advantages and disadvantages relative to previous modeling frameworks. The disadvantages are primarily that there is no explicit representation of pg(x), and that D must be synchronized well with G during training (in particular, G must not be trained too much without updatingD, in order to avoid “the Helvetica scenario” in which G collapses too many values of z to the same value of x to have enough diversity to model pdata), much as the negative chains of a Boltzmann machine must be kept up to date between learning steps. The advantages are that Markov chains are never needed, only backprop is used to obtain gradients, no inference is needed during learning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes the comparison of generative adversarial nets with other generative modeling approaches.与以前的建模框架相比,这个新框架具有优点和缺点。 缺点主要在于没有pg(x)的明确表示,并且D在训练期间必须与G很好地同步(特别是,在没有更新D的情况下,G不得过多训练,以避免“Helvetica场景”: G将太多的z值折叠到x的相同值以具有足够的多样性来模拟pdata),就像Boltzmann机器的负链必须在学习步骤之间保持最新一样。 优点是永远不需要马尔可夫链,只有backprop用于获得渐变,学习期间不需要推理,并且可以将多种功能合并到模型中。 表2总结了生成性对抗网与其他生成建模方法的比较。

table 2

功能 Deep directed graphical models深度定向图形模型 Deep undirected graphical models深度无向图形模型 Generative autoencoders生成自编码器 Adversarial models对抗模型
Training Inference needed during training. Inference needed during training. MCMC needed to approximate partition function gradient.训练期间需要推理。 MCMC需要近似分区函数梯度。 Enforced tradeoff between mixing and power of reconstruction generation在混合和重建之间进行强制权衡 Synchronizing the discriminator with the generator. Helvetica. 使鉴别器与生成器同步。黑体。
Inference推理 Learned approximate inference 学习近似推理 Variational inference MCMC-based inference Learned approximate inference
Sampling No difficulties Requires Markov chain Requires Markov chain No difficulties
Evaluating p(x) 评估 Intractable, may be approximated with AIS难以处理,可以用AIS近似 Intractable, may be approximated with AIS Not explicitly represented, may be approximated with Parzen density estimation未明确表示,可以用Parzen密度估计近似 Not explicitly represented, may be approximated with Parzen density estimation
Model design Nearly all models incur extreme difficulty几乎所有模型都会遇到极大的困难 Careful design needed to ensure multiple properties 需要精心设计以确保多种属性 Any differentiable function is theoretically permitted 理论上允许任何可微分的函数 Any differentiable function is theoretically permitted

Table 2: Challenges in generative modeling: a summary of the difficulties encountered by different approaches to deep generative modeling for each of the major operations involving a model.表2:生成建模中的挑战:对涉及模型的每个主要操作的深度生成建模的不同方法遇到的困难的总结。

The aforementioned advantages are primarily computational. Adversarial models may also gain some statistical advantage from the generator network not being updated directly with data examples, but only with gradients flowing through the discriminator. This means that components of the input are not copied directly into the generator’s parameters. Another advantage of adversarial networks is that they can represent very sharp, even degenerate distributions, while methods based on Markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between modes.上述优点主要是计算的。 对抗模型也可以从生成模型获得一些统计优势,而不是直接用数据示例更新,而是仅通过流经鉴别器的梯度。 这意味着输入的组件不会直接复制到生成器的参数中。 对抗性网络的另一个优点是它们可以表示非常尖锐,甚至退化的分布,而基于马尔可夫链的方法要求分布有些模糊,以便链能够在模式之间混合。

7. 结论和未来的工作

This framework admits many straightforward extensions:该框架吸收了许多简单的扩展:

  1. A conditional generative model p(x c) can be obtained by adding c as input to both G and D.条件生成模型p(x c)可以通过添加c作为G和D的输入来获得。
  2. Learned approximate inference can be performed by training an auxiliary network to predict z given x. This is similar to the inference net trained by the wake-sleep algorithm [15] but with the advantage that the inference net may be trained for a fixed generator net after the generator net has finished training.可以通过训练辅助网络来预测给定x的z来进行学习的近似推断。 这类似于由唤醒 - 睡眠算法[15]训练的推理网,但具有以下优点:在生成网络完成训练之后,可以针对固定的生成网络训练鉴别网络。

  3. One can approximately model all conditionals p(xS j x6S) where S is a subset of the indices of x by training a family of conditional models that share parameters. Essentially, one can use adversarial nets to implement a stochastic extension of the deterministic MP-DBM [11].可以通过训练共享参数的条件模型族来近似地模拟所有条件p(xS | x6S),其中S是x的索引的子集。 从本质上讲,人们可以使用对抗网来实现确定性MP-DBM的随机扩展[11]。

  4. Semi-supervised learning: features from the discriminator or inference net could improve performance of classifiers when limited labeled data is available.半监督学习:当有限的标记数据可用时,来自鉴别器或推理网的特征可以提高分类器的性能。

  5. Efficiency improvements: training could be accelerated greatly by divising better methods for coordinating G and D or determining better distributions to sample z from during training.提高效率:通过划分更好的协调G和D的方法或确定在培训期间更好地分配样本z,可以大大加快培训。

This paper has demonstrated the viability of the adversarial modeling framework, suggesting that these research directions could prove useful.本文证明了对抗性建模框架的可行性,表明这些研究方向可能有用。

致谢

We would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume Alain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window evaluation code with us. We would like to thank the developers of Pylearn2 [12] and Theano [7, 1], particularly Fr´ed´eric Bastien who rushed a Theano feature specifically to benefit this project. Arnaud Bergeron provided much-needed support with LATEX typesetting. We would also like to thank CIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Qu´ebec for providing computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in Deep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.我们要感谢Patrice Marcotte,Olivier Delalleau,Kyunghyun Cho,Guillaume Alain和Jason Yosinski的有益讨论。 Yann Dauphin与我们分享了他的Parzen窗口评估代码。 我们要感谢Pylearn2 [12]和Theano [7,1]的开发人员,尤其是Fr’ed’eric Bastien,他特意推出了Theano功能以使该项目受益。 Arnaud Bergeron为LATEX排版提供了急需的支持。 我们还要感谢CIFAR和加拿大研究主席的资助,以及Compute Canada和Calcul Qu’ebec提供的计算资源。 Ian Goodfellow得到2013年谷歌深度学习奖学金的支持。 最后,我们要感谢Les Trois Brasseurs激发我们的创造力。

参考文献

  • [1] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
  • [2] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.
  • [3] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep representations. In ICML’13.
  • [4] Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-encoders as generative models. In NIPS26. Nips Foundation.
  • [5] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. In ICML’14.
  • [6] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic networks trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning (ICML’14).
  • [7] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
  • [8] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an RBM-derived process. Neural Computation, 23(8), 2053–2073.
  • [9] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS’2011.
  • [10] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks. In ICML’2013.
  • [11] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. In NIPS’2013.
  • [12] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.
  • [13] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS’2010.
  • [14] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
  • [15] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1558–1161.
  • [16] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
  • [17] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
  • [18] Hyv¨arinen, A. (2005). Estimation of non-normalized statistical models using score matching. J. Machine Learning Res., 6.
  • [19] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153. IEEE.
  • [20] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
  • [21] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
  • [22] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS’2012.
  • [23] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  • [24] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. Technical report, arXiv:1401.4082.
  • [25] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive auto-encoders. In ICML’12.
  • [26] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448–455.
  • [27] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge.
  • [28] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010-001, U. Toronto.
  • [29] Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, pages 1064–1071. ACM.
  • [30] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML 2008.
  • [31] Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65(3), 177–228.

Content