生成对抗网络(一)

前言

2014年,Ian Goodfellow等人提出一种全新神经网络训练模型–生成对抗模型(Generative Adversarial Network,GAN)。GAN是近年来最火热的无监督学习算法,在此基础上,许多研究者对其不断进行改进,衍生出很多相应的算法,也逐渐将其应用到半监督和有监督学习中。该系列博客主要记录从GAN的理论推导,编程实现和将GAN应用到半监督学习中。这第一篇文章就先将GAN的理论知识和实现简单介绍下。

GAN理论推导

原文abstract截图
首先,需要知道的是生成对抗网络的设计是为了estimating generative model,而”adversarial process”是为了完成这一任务的辅助手段。原论文摘要中对GAN的设计思想、大致的训练过程都做了很好的说明。下面就好好学习学习GAN的具体理论部分。

一些概念和约定

生成对抗网络主要又两个重要的部分组成:

  • 生成器(Generator):通过不断学习,生成与真实数据相仿的“假数据”,其目标是要成功“欺骗”鉴别器(Discriminator)
  • 鉴别器(Discriminator):尽可能将真实数据与Generator生成的“假数据”区分开来。

生成器与鉴别器相辅相成,两者之间不断对抗的过程,也是二者之间不断学习的过程。结束对抗过程之后,理想情况是Generator能够生成与真实数据分布高度吻合的数据,而Discriminator已经完全不能区分出真实数据与“假数据”。因此,在原文中,作者也提到了”D equal to $\dfrac{1}{2}$ everywhere”。

  • $p_g$:Generator通过真实数据$x$生成的分布
  • $p_{data}$:真实数据分布

理论部分

首先直接引入GAN的目标函数,然后我们再对其进行简单的说明,
\begin{equation}
\min_{G}\max_{D}V(D,G)=\mathop{\mathbb{E}}\limits_{x\sim p_{data}(x)}[\log D(x)]+\mathop{\mathbb{E}}\limits_{z\sim p_z(z)}[\log(1-D(G(z)))].
\end{equation}

其中,$p_z(z)$是预先定义好的噪声分布,$G(z;\theta_g)$是一个参数为$\theta_g$的可微多层感知器,$D(x;\theta_d)$也是一个多层感知器,其输出为一个标量,$D(x)$表示了$x$来自真实数据而不是$p_g$的概率。

通过训练$D$,最大化概率区分数据来自$p_{data}$还是$p_g$(显然地,$p_{data}$和$p_g$中的数据都应该进入$D$中进行鉴别),同时训练$G$来最小化$\log(1-D(G(z)))$

注:至于该目标函数是怎么来的,还没有看到一个比较认可的说法。要再好好查查!

训练策略


Figure 1可得好好看明白,因为它描述了GAN的一个训练过程。
下面给出GAN算法

证明目标函数收敛

此部分需要证明GAN中的目标函数是收敛的,即是可以达到最优解的。为什么要证明目标函数是收敛的?因为该目标函数最大最小优化问题,可能不存在最优解,故需要证明是收敛的。

先证明存在最有鉴别器$D$,

知道
\begin{equation}
V(D,G)=\int_x [p_{data}(x)\log D(x)+p_g(x)\log(1-D(x))]dx
\end{equation}

现在,先考虑$\forall a, b$, $a\log x+b\log(1-x)$,有最大值当$x=\dfrac{a}{a+b}$时。再回到公式(2),$V(D,G)$也有最大值存在,当$D=D_G^*=\dfrac{p_{data}(x)}{p_{data}(x)+p_g(x)}$,我们知道最优鉴别器应该处处$D(x)=\dfrac{1}{2}$,因此$p_{data}(x)=p_g(x)$时,鉴别器$D$达到最优。

接下来就要证明当且仅当$p_{data}(x)=p_g(x)$时,$C(G)=\max V(D,G)$达到最小值。

$\stackrel{充分性}{\Longrightarrow}$ 当$p_{data}(x)=p_g(x)$时,
\begin{equation}
\begin{split}
C(G)&=\int_x p_{data}(x)\log(\dfrac{1}{2})+p_{g}(x)\log(\dfrac{1}{2})dx\
&=\log(\dfrac{1}{2})(1+1)\
&=-\log 4
\end{split}
\end{equation}

$\stackrel{必要性}{\Longleftarrow}$

$$
\begin{split}
C(G)&=&\int_x p_{data}(x)\log\dfrac{p_{data}(x)}{p_{data}(x)+p_g(x)}+p_g(x)\log\dfrac{p_{g}(x)}{p_{data}(x)+p_g(x)}\\
&=&\int_x(\log2-\log2)p_{data}(x)+p_{data}(x)\log\dfrac{p_{data}(x)}{p_{data}(x)+p_g(x)}\\
&&+(\log2-\log2)p_g(x)+p_{g}(x)\log\dfrac{p_{g}(x)}{p_{data}(x)+p_g(x)}dx\\
&=&-\log 2\int_x(p_{data}(x)+p_g(x))dx+\int_x p_{data}(x)(\log 2+\log\dfrac{p_{data}(x)}{p_{data}(x)+p_g(x)})dx\\
&&+\int_xp_g(x)(\log2+\log(1-\dfrac{p_{data}(x)}{p_{data}(x)+p_g(x)}))dx\\
&=&-\log4+\int_xp_{data}(x)\log\dfrac{p_{data}(x)}{(p_{data}(x)+p_g(x))/2}dx\\
&&+\int_xp_g(x)\log\dfrac{p_g(x)}{(p_{data}(x)+p_g(x))/2}dx\\
&=&-\log4+KL(p_{data}||\dfrac{p_{data}(x)+p_g(x)}{2})+KL(p_g||\dfrac{p_{data}(x)+p_g(x)}{2})\\
&=&-\log4+2JSD(p_{data}(x)||p_g(x))\geq -\log 4
\end{split}
$$

当且仅当$p_{data}(x)=p_g(x)$时,取最小值$-\log 4$,证毕。

GAN实现(PyTorch)

到目前为止,将GAN的整个理论部分过了一遍。下面开始用PyTorch来简单实现下GAN。
首先导入需要的库,然后构造生成器和鉴别器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from scipy.stats import norm

class Generator(nn.Module):
def __init__(self, input_size, hidden_size, output_size, f):
super(Generator, self).__init__()
self.map1 = nn.Linear(input_size, hidden_size)
self.map2 = nn.Linear(hidden_size, hidden_size)
self.map3 = nn.Linear(hidden_size, output_size)
self.f = f

def forward(self, x):
x = self.map1(x)
x = self.f(x)
x = self.map2(x)
x = self.f(x)
x = self.map3(x)
return x

class Discriminator(nn.Module):
def __init__(self, input_size, hidden_size, output_size, f):
super(Discriminator, self).__init__()
self.map1 = nn.Linear(input_size, hidden_size)
self.map2 = nn.Linear(hidden_size, hidden_size)
self.map3 = nn.Linear(hidden_size, output_size)
self.f = f

def forward(self, x):
x = self.f(self.map1(x))
x = self.f(self.map2(x))
return self.f(self.map3(x))

下面对输入数据做简单处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(name, preprocess, d_input_func) = ("Only 4 moments", lambda data: get_moments(data), lambda x: 4)
def extract(v):
return v.data.storage().tolist()

def stats(d):
return [np.mean(d), np.std(d)]

def get_moments(d):
# Return the first 4 moments of the data provided
mean = torch.mean(d)
diffs = d - mean
var = torch.mean(torch.pow(diffs, 2.0))
std = torch.pow(var, 0.5)
zscores = diffs / std
skews = torch.mean(torch.pow(zscores, 3.0)) # 计算偏度
kurtoses = torch.mean(torch.pow(zscores, 4.0)) - 3.0 # 计算峰度 excess kurtosis, should be 0 for Gaussian
final = torch.cat((mean.reshape(1,), std.reshape(1,), skews.reshape(1,), kurtoses.reshape(1,)))
return final

接下来设置超参数并开始训练GAN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
data_mean = 4
data_stddev = 1.25
def train():
# Model parameters
g_input_size = 1 # Random noise dimension coming into generator, per output vector
g_hidden_size = 5 # Generator complexity
g_output_size = 1 # Size of generated output vector
d_input_size = 500 # Minibatch size - cardinality of distributions
d_hidden_size = 10 # Discriminator complexity
d_output_size = 1 # Single dimension for 'real' vs. 'fake' classification
minibatch_size = d_input_size

d_learning_rate = 1e-3
g_learning_rate = 1e-3
sgd_momentum = 0.9

num_epochs = 2000
print_interval = 100
d_steps = 20
g_steps = 20

dfe, dre, ge = 0, 0, 0
d_real_data, d_fake_data, g_fake_data = None, None, None

discriminator_activation_function = torch.sigmoid # 鉴别器激活函数
generator_activation_function = torch.tanh # 生成器激活函数

d_sampler = get_distribution_sampler(data_mean, data_stddev)
gi_sampler = get_generator_input_sampler()
G = Generator(input_size=g_input_size,
hidden_size=g_hidden_size,
output_size=g_output_size,
f=generator_activation_function)
D = Discriminator(input_size=d_input_func(d_input_size),
hidden_size=d_hidden_size,
output_size=d_output_size,
f=discriminator_activation_function)
criterion = nn.BCELoss() # Binary cross entropy: http://pytorch.org/docs/nn.html#bceloss
d_optimizer = optim.SGD(D.parameters(), lr=d_learning_rate, momentum=sgd_momentum)
g_optimizer = optim.SGD(G.parameters(), lr=g_learning_rate, momentum=sgd_momentum)

for epoch in range(num_epochs):
for d_index in range(d_steps):
# 1. Train D on real+fake
D.zero_grad()

# 1A: Train D on real
d_real_data = torch.Tensor(d_sampler(d_input_size)) # 从均值为4,标准差为1.25的正态分布中抽样500个样本作为real data
d_real_decision = D(preprocess(d_real_data)) # 将该数据的均值,方差,偏度,峰度作为鉴别器输入
d_real_error = criterion(d_real_decision, torch.Tensor(torch.ones([1,1]))) # ones = true
d_real_error.backward() # compute/store gradients, but don't change params

# 1B: Train D on fake
d_gen_input = torch.Tensor(gi_sampler(minibatch_size, g_input_size))
d_fake_data = G(d_gen_input).detach() # detach to avoid training G on these labels
d_fake_decision = D(preprocess(d_fake_data.t()))
d_fake_error = criterion(d_fake_decision, torch.Tensor(torch.zeros([1,1]))) # zeros = fake
d_fake_error.backward()
d_optimizer.step() # Only optimizes D's parameters; changes based on stored gradients from backward()

dre, dfe = extract(d_real_error)[0], extract(d_fake_error)[0]

for g_index in range(g_steps):
# 2. Train G on D's response (but DO NOT train D on these labels)
G.zero_grad()

gen_input = torch.Tensor(gi_sampler(minibatch_size, g_input_size))
g_fake_data = G(gen_input)
dg_fake_decision = D(preprocess(g_fake_data.t()))
g_error = criterion(dg_fake_decision, torch.Tensor(torch.ones([1,1]))) # Train G to pretend it's genuine

g_error.backward()
g_optimizer.step() # Only optimizes G's parameters
ge = extract(g_error)[0]

if epoch % print_interval == 0:
print("Epoch %s: D (%s real_err, %s fake_err) G (%s err); Real Dist (%s), Fake Dist (%s) " %
(epoch, dre, dfe, ge, stats(extract(d_real_data)), stats(extract(d_fake_data))))

print("Plotting the generated distribution...")
values = extract(g_fake_data)
mu = np.mean(values)
sigma = np.std(values)
print(" Values: %s" % (str(values)))
n, bins, patches = plt.hist(values, bins=100, normed=1)
y = norm.pdf(bins, mu, sigma)
plt.plot(bins, y, 'r--')
plt.xlabel('Value')
plt.ylabel('probability')
plt.title(r'Histogram of Generated Distribution: $\mu={}$, $\sigma={}$'.format(round(mu,2), round(sigma,2)))
plt.grid(True)
plt.show()


train()

好,我们来看看最后训练出来的结果

可以看到,我们预先设置好的正态分布$\mu=4, \sigma=1.25$,通过GAN训练之后,生成器$G$从均匀分布中生成了$\mu=3.9, \sigma=1.24$的正态分布,这已经非常逼近真实分布$p_{data}(x)=N(4, 1.25)$。

第一篇GAN记录就写到这里,下次学习了将GAN用于半监督学习之后再写。