Welcome To gdyshi's Blog

tensorflow模型导出记录

2018-11-27

python

tensorflow 模型部署非标准模型

摘要
引言
主题
meta方式
查找最终输出的op
查找输入tensor
验证能否强行使用feed_dict改变变量的值
- batchnorm问题
附录
参考

摘要

本文记录部署一个非标准模型（未定义name、未定义placeholder、未定义batchnorm中的train参数）的过程

引言

之前训练的一个比较好的模型需要部署到实际应用场景中，但从之前训练时到现在，tensorflow版本已经更新了7、8个，一些借口已经改变。给部署带来一定的难度

主题

本文在旧的tensorflow版本上先进行模式导出和试验，成功后再部署到新的tensorflow版本上，先采用最基础的meta方式进行导入导出

meta方式

参考Tensorflow加载预训练模型和保存模型

查看模型所有name代码 print('graph names:{}'.format(graph._names_in_use))

查找最终输出的op

原模型最终代码：

# 全连接层 + Softmax
with tf.variable_scope('logit'):
    logits = self._fully_connected(x, self.hps.num_classes)

self._fully_connected最后一步调用了tf.nn.xw_plus_b(x, w, b) 写一个例子程序，用来查看默认的输出op名称

graph = tf.get_default_graph()
w = tf.Variable(np.random.randn(5, 5).astype('float32'),name="w")
x = tf.Variable(np.random.randn(5, 5).astype('float32'), name="w2")
b = tf.Variable(np.random.randn(5).astype('float32'), name="b")
tf.nn.xw_plus_b(w, x, b)
print('graph names:{}'.format(graph._names_in_use))

查看输出为：

graph names:{'w/read': 1, 'b/assign': 1, 'b': 1, 'w2/read': 1, 'xw_plus_b/matmul': 1, 'w2': 1, 'w2/assign': 1, 'b/initial_value': 1, 'w/assign': 1, 'w/initial_value': 1, 'w': 1, 'b/read': 1, 'xw_plus_b': 1, 'w2/initial_value': 1}

可以看到xw_plus_b为操作的原始OP，那么原模型最终输出op为"logit/xw_plus_b:1"，模型部署时的输出代码为

op_logit = graph.get_tensor_by_name("logit/xw_plus_b:0")
logits=sess.run(op_logit, feed_dict)
predictions = np.argmax(logits, axis=1)

查找输入tensor

写一个例子程序，用来查看默认的输入op名称

graph = tf.get_default_graph()
image = tf.Variable(np.random.randn(5, 60000).astype('float32'),name="image")
label = tf.Variable(np.random.randn(5).astype('int'), name="label")
data_num = tf.Variable(np.random.randn(5).astype('int'), name="data_num")
data_num, images, sparse_labels = tf.train.shuffle_batch(
        [data_num, image, label], batch_size=5, num_threads=2,
        capacity=20,        min_after_dequeue=10)
print('graph names:{}'.format(graph._names_in_use))

查看输出为：

graph names:{'shuffle_batch/tofloat': 1, 'image/read': 1, 'label': 1, 'image': 1, 'shuffle_batch/sub': 1, 'shuffle_batch/const': 1, 'label/assign': 1, 'label/read': 1, 'shuffle_batch/random_shuffle_queue_close': 2, 'shuffle_batch/maximum': 1, 'shuffle_batch/random_shuffle_queue_enqueue': 1, 'data_num': 1, 'shuffle_batch/mul/y': 1, 'image/initial_value': 1, 'shuffle_batch/random_shuffle_queue': 1, 'shuffle_batch/sub/y': 1, 'data_num/assign': 1, 'shuffle_batch/random_shuffle_queue_close_1': 1, 'label/initial_value': 1, 'data_num/read': 1, 'image/assign': 1, 'shuffle_batch/random_shuffle_queue_size': 1, 'shuffle_batch/n': 1, 'shuffle_batch/fraction_over_10_of_10_full/tags': 1, 'shuffle_batch': 1, 'shuffle_batch/maximum/x': 1, 'data_num/initial_value': 1, 'shuffle_batch/mul': 1, 'shuffle_batch/fraction_over_10_of_10_full': 1}

可以看到shuffle_batch为操作的原始OP，那么原模型输入tensor为"input/shuffle_batch:1"，模型部署时的输出代码为

op_logit = graph.get_tensor_by_name("logit/xw_plus_b:0")
logits=sess.run(op_logit, feed_dict)
predictions = np.argmax(logits, axis=1)

验证能否强行使用feed_dict改变变量的值

写一个例子程序，用来查看强行填充feed_dict的效果

graph = tf.get_default_graph()

w = tf.placeholder("float", name="w")
w1 = tf.Variable(5.0, name="w2")
x = tf.Variable(2.0, name="w2")
feed_dict={w:5.0,w1:4.0}
y = tf.multiply(w,x)
y1 = tf.multiply(w1,x)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
ty,ty1=sess.run([y,y1],feed_dict=feed_dict)
print(ty)
print(ty1)

查看输出为：

10.0
8.0

可以看到y1的值为强制填充后计算结果4*2=8，feeddict有效

在原模型中加入强行填充feed_dict代码

images_placeholder = np.random.randn(100, 60000).astype('float32')
feed_dict={test_data: images_placeholder}

batchnorm问题

tensorflow中batch normalization的用法 batchnorm参数需要在训练时加入代码和重新训练，暂不考虑加入，通过每次预测输入足量样本来解决

附录

参考

Read All

译文 Generative Adversarial Nets

2018-10-16

机器学习

译文生成对抗网络机器学习非监督

摘要
1. 介绍
2. 相关工作
3. 对抗网络
4. 理论结果
- 4.1. pg = pdata的全局最优性
- 4.2. 算法1的收敛性
5. 试验
6. 优势和劣势
7. 结论和未来的工作
致谢
参考文献

摘要

We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.我们提出了一个通过对抗过程来估计生成模型的新框架，其中我们同时训练两个模型：捕获数据分布的生成模型G和估计样本来自训练数据的概率的判别模型D. G的训练程序是最大化D犯错误的概率。该框架对应于minimax双人游戏。在任意函数G和D的空间中，存在唯一的解决方案，其中G恢复训练数据分布并且D等于1/2。在G和D由多层感知器定义的情况下，整个系统可以用反向传播进行训练。在训练或生成样本期间不需要任何马尔可夫链或展开的近似推断网络。实验通过对生成的样品进行定性和定量评估来证明该框架的潜力。

1. 介绍

The promise of deep learning is to discover rich, hierarchical models [2] that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. So far, the most striking successes in deep learning have involved discriminative models, usually those that map a high-dimensional, rich sensory input to a class label [14, 22]. These striking successes have primarily been based on the backpropagation and dropout algorithms, using piecewise linear units [19, 9, 10] which have a particularly well-behaved gradient . Deep generative models have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties. 深度学习的前景是发现丰富的层次模型[2]，它们表示人工智能应用中遇到的各种数据的概率分布，如自然图像，包含语音的音频波形和自然语言语料库中的符号。到目前为止，深度学习中最引人注目的成功涉及辨别模型，通常是那些将高维度，丰富的感官输入映射到类标签的模型[14,22]。这些惊人的成功主要基于反向传播和丢失算法，使用分段线性单元[19,9,10]，它们具有特别良好的梯度。但对深度生成模型的影响较小，因为在最大似然估计及相关策略上，很多概率计算的模拟非常难，并且由于难以在生成模型中利用分段线性单元的好处。我们提出了一种新的生成模型估计程序，可以避免这些困难。

In the proposed adversarial nets framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.在提出的对抗性网络框架中，生成模型与对手进行竞争：一种判别模型，用于学习确定样本是来自模型分布还是数据分布。生成模型可以被认为类似于造假者团队，试图生产虚假货币并在没有检测的情况下使用它，而歧视模型类似于警察，试图检测伪造货币。在这个游戏中的竞争促使两个团队改进他们的方法，直到假冒伪劣品与真品无法区分。

This framework can yield specific training algorithms for many kinds of model and optimization algorithm. In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the discriminative model is also a multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train both models using only the highly successful backpropagation and dropout algorithms [17] and sample from the generative model using only forward propagation. No approximate inference or Markov chains are necessary.该框架可以为多种模型和优化算法提供特定的训练算法。在本文中，我们探讨了生成模型通过多层感知器传递随机噪声生成样本的特殊情况，并且判别模型也是多层感知器。我们将这种特殊情况称为对抗性网络。在这种情况下，我们可以仅使用非常成功的反向传播和丢失算法[17]来训练两个模型，并且仅使用前向传播来生成来自生成模型的样本。不需要近似推理或马尔可夫链。

2. 相关工作

An alternative to directed graphical models with latent variables are undirected graphical models with latent variables, such as restricted Boltzmann machines (RBMs) [27, 16], deep Boltzmann machines (DBMs) [26] and their numerous variants. The interactions within such models are represented as the product of unnormalized potential functions, normalized by a global summation/integration over all states of the random variables. This quantity (the partition function) and its gradient are intractable for all but the most trivial instances, although they can be estimated by Markov chain Monte Carlo (MCMC) methods. Mixing poses a significant problem for learning algorithms that rely on MCMC [3, 5].具有潜在变量的定向图形模型的替代方案是具有潜变量的无向图形模型，例如受限制的玻尔兹曼机（RBM）[27,16]，深玻尔兹曼机（DBM）[26]及其众多变体。这些模型中的相互作用表示为非标准化势函数的乘积，通过随机变量的所有状态的全局求和/积分来归一化。这个数量（分区函数）及其梯度对于除了最平凡的实例之外的所有实例都是难以处理的，尽管它们可以通过马尔可夫链蒙特卡罗（MCMC）方法来估计。混合对于依赖MCMC的学习算法提出了一个重要挑战[3,5]。

Deep belief networks (DBNs) [16] are hybrid models containing a single undirected layer and several directed layers. While a fast approximate layer-wise training criterion exists, DBNs incur the computational difficulties associated with both undirected and directed models.深度置信网络（DBN）[16]是包含单个无向层和若干有向层的混合模型。虽然存在快速近似分层训练标准，但DBN会引起与无向和定向模型相关的计算困难。

Alternative criteria that do not approximate or bound the log-likelihood have also been proposed, such as score matching [18] and noise-contrastive estimation (NCE) [13]. Both of these require the learned probability density to be analytically specified up to a normalization constant. Note that in many interesting generative models with several layers of latent variables (such as DBNs and DBMs), it is not even possible to derive a tractable unnormalized probability density. Some models such as denoising auto-encoders [30] and contractive autoencoders have learning rules very similar to score matching applied to RBMs. In NCE, as in this work, a discriminative training criterion is employed to fit a generative model. However, rather than fitting a separate discriminative model, the generative model itself is used to discriminate generated data from samples a fixed noise distribution. Because NCE uses a fixed noise distribution, learning slows dramatically after the model has learned even an approximately correct distribution over a small subset of the observed variables.还提出了不接近或约束对数似然的替代标准，例如得分匹配[18]和噪声对比估计（NCE）[13]。这两者都要求将学习的概率密度分析地指定为归一化常数。请注意，在许多具有多层潜在变量（例如DBN和DBM）的有趣生成模型中，甚至不可能得出易处理的非标准化概率密度。某些模型（如去噪自动编码器[30]和收缩自动编码器）的学习规则与应用于RBM的分数匹配非常相似。在NCE中，与本研究一样，采用判别训练标准来拟合生成模型。然而，生成模型本身不是用于拟合单独的判别模型，而是用于将生成的数据与样本区分为固定的噪声分布。因为NCE使用固定的噪声分布，所以在模型在一小部分观察变量上学习了大致正确的分布后，学习速度显着减慢。

Finally, some techniques do not involve defining a probability distribution explicitly, but rather train a generative machine to draw samples from the desired distribution. This approach has the advantage that such machines can be designed to be trained by back-propagation. Prominent recent work in this area includes the generative stochastic network (GSN) framework [5], which extends generalized denoising auto-encoders [4]: both can be seen as defining a parameterized Markov chain, i.e., one learns the parameters of a machine that performs one step of a generative Markov chain. Compared to GSNs, the adversarial nets framework does not require a Markov chain for sampling. Because adversarial nets do not require feedback loops during generation, they are better able to leverage piecewise linear units [19, 9, 10], which improve the performance of backpropagation but have problems with unbounded activation when used ina feedback loop. More recent examples of training a generative machine by back-propagating into it include recent work on auto-encoding variational Bayes [20] and stochastic backpropagation [24].最后，一些技术不涉及明确定义概率分布，而是训练生成机器从所需分布中抽取样本。这种方法的优点是可以将这种机器设计成通过反向传播进行训练。最近在该领域的突出工作包括生成随机网络（GSN）框架[5]，它扩展了广义去噪自动编码器[4]：两者都可以看作是定义参数化马尔可夫链，即学习机器的参数执行生成马尔可夫链的一步。与GSN相比，对抗性网络框架不需要马尔可夫链进行采样。因为对抗网在生成期间不需要反馈回路，所以它们能够更好地利用分段线性单元[19,9,10]，这提高了反向传播的性能，但是当在反馈回路中使用时存在无界激活的问题。最近通过反向传播训练生成模型的例子包括最近关于变分贝叶斯[20]和随机反向传播[24]的自动编码工作。

3. 对抗网络

The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons. To learn the generator’s distribution pg over data x, we define a prior on input noise variables pz(z), then represent a mapping to data space as G(z; g), where G is a differentiable function represented by a multilayer perceptron with parameters g. We also define a second multilayer perceptron D(x; d) that outputs a single scalar. D(x) represents the probability that x came from the data rather than pg. We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1-D(G(z))):当模型是多层感知器时，对抗建模框架可直接应用。为了在数据x上学习生成器的分布pg，我们在输入噪声变量pz（z）上定义先验，然后表示到数据空间的映射为G（z;g），其中G是由多层感知器表示的带参数g的可微函数。我们还定义了第二个多层感知器D（x;d），它输出一个标量。 D（x）表示x来自数据而不是pg的概率。我们训练D以最大化为训练样本和来自G的样本分配正确标签的概率。我们同时训练G以最小化log（1-D（G（z）））：

In other words, D and G play the following two-player minimax game with value function V (G;D):换句话说，D和G使用值函数V（G; D）进行以下双人最大最小极限运动：

equation 1

In the next section, we present a theoretical analysis of adversarial nets, essentially showing that the training criterion allows one to recover the data generating distribution as G and D are given enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical explanation of the approach. In practice, we must implement the game using an iterative, numerical approach. Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough. This strategy is analogous to the way that SML/PCD [31, 29] training maintains samples from a Markov chain from one learning step to the next in order to avoid burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1.在下一节中，我们提出了对抗网络的理论分析，基本上表明训练标准允许人们恢复数据生成分布，因为G和D被赋予足够的容量，即在非参数限制中。有关该方法的不太正式，更具教学意义的解释，请参见图1。在实践中，我们必须使用迭代的数值方法来实现游戏。在训练的内循环中优化D到完成在计算上是禁止的，并且在有限数据集上将导致过度拟合。相反，我们在优化D的k个步骤和优化G的一个步骤之间交替。这导致D保持接近其最优解，只要G变化足够慢。这种策略类似于SML / PCD [31,29]训练将马尔可夫链中的样本从一个学习步骤维持到下一个学习步骤的方式，以避免作为学习内循环的一部分在马尔可夫链中燃烧。该过程在算法1中正式呈现。

In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 - D(G(z))) saturates. Rather than training G to minimize log(1 - D(G(z))) we can train G to maximize logD(G(z)). This objective function results in the same fixed point of the dynamics of G and D but provides much stronger gradients early in learning.在实践中，等式1可能无法为G学习提供足够的梯度。在学习初期，当G很差时，D可以高度自信地拒绝样本，因为它们与训练数据明显不同。在这种情况下，log（1-D（G（z）））饱和。我们可以训练G以最大化logD（G（z）），而不是训练G以最小化log（1-D（G（z）））。该目标函数导致G和D动力学的相同固定点，但在学习早期提供更强的梯度。

Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg. (a) Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier. (b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D(x) = pdata(x) /(pdata(x)+pg(x)) . (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likely to be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach a point at which both cannot improve because pg = pdata. The discriminator is unable to differentiate between the two distributions, i.e. D(x) = 1/2 .图1：生成对抗网络通过同时更新判别分布（D，蓝色，虚线）进行训练，以便区分来自数据生成分布（黑色，虚线）px的样本与生成分布pg（G）的样本（绿色，实线）。下面的水平线是从中采样z的域，在这种情况下是均匀的。上面的水平线是x域的一部分。向上箭头表示映射x = G（z）如何在变换样本上施加非均匀分布pg。 G在高密度区域收缩，在低密度区域扩展。（a）考虑靠近收敛的对抗对：pg类似于pdata，D是部分精确的分类器。（b）在算法D的内环中，训练D来区分样本和数据，收敛到D？（x）= pdata（x）/（pdata（x）+ pg（x））。（c）在更新G之后，D的梯度引导G（z）流向更可能被分类为数据的区域。（d）经过几个步骤的训练后，如果G和D有足够的容量，它们将达到两个都无法改善的点，因为pg = pdata。鉴别器不能区分两种分布，即D（x）= 1/2。

4. 理论结果

The generator G implicitly defines a probability distribution pg as the distribution of the samples G(z) obtained when z ~ pz. Therefore, we would like Algorithm 1 to converge to a good estimator of pdata, if given enough capacity and training time. The results of this section are done in a nonparametric setting, e.g. we represent a model with infinite capacity by studying convergence in the space of probability density functions.生成器G隐含地将概率分布pg定义为当z~1时获得的样本G（z）的分布。因此，如果给定足够的容量和训练时间，我们希望算法1收敛到pdata的良好估计。本部分的结果是在非参数设置中完成的，例如我们通过研究概率密度函数空间中的收敛来表示具有无限容量的模型。

algorithm 1

4.1. pg = pdata的全局最优性

We first consider the optimal discriminator D for any given generator G.我们首先考虑任何给定生成器G的最佳鉴别器D.

Proposition 1. For G fixed, the optimal discriminator D is 命题1.对于G固定，最优鉴别器D是

equation 2

Proof. The training criterion for the discriminator D, given any generator G, is to maximize the quantity V (G;D) 证明。鉴于任何生成器G，鉴别器D的训练标准是最大化量V（G; D）

equation 3

For any (a, b) 2 R2{0,0}, the function y –>a log(y) + b log(1 - y) achieves its maximum in [0; 1] at a/(a+b) . The discriminator does not need to be defined outside of Supp(pdata) [ Supp(pg), concluding the proof.对于任何（a，b）2 R2 \ {0,0}，函数y - > log（y）+ b log（1 - y）在[0,1]范围内的最大值是a/（a + b）。鉴别器不需要在Supp（pdata）[Supp（pg）之外定义，结论证明。

Note that the training objective for D can be interpreted as maximizing the log-likelihood for estimating the conditional probability P(Y = y

x), where Y indicates whether x comes from pdata (with y = 1) or from pg (with y = 0). The minimax game in Eq. 1 can now be reformulated as:注意，D的训练目标可以解释为最大化用于估计条件概率P（Y = y

x）的对数似然，其中Y表示x是来自pdata（y = 1）还是来自pg（y = 0）。方程式 1中的极小极大游戏现在可以重新表述为：

equation 4

Theorem 1. The global minimum of the virtual training criterion C(G) is achieved if and only if pg = pdata. At that point, C(G) achieves the value -log 4.定理1.当且仅当pg = pdata时，实现虚拟训练标准C（G）的全局最小值。此时，C（G）达到值 -log 4。

Proof. For pg = pdata, DG(x) = 1/2 , (consider Eq. 2). Hence, by inspecting Eq. 4 at DG(x) = 1/2 , we find C(G) = log 1/2 + log 1/2 = -log 4. To see that this is the best possible value of C(G), reached only for pg = pdata, observe that 证明。对于pg = pdata，D* G（x）= 1/2，（考虑方程2）。因此，通过检查公式4。在D* G（x）= 1/2时，我们发现C（G）= log 1/2 + log 1/2 = -log 4.为了看到这是C（G）的最佳可能值，仅针对pg = pdata，观察到

screen 1

and that by subtracting this expression from C(G) = V (DG,G), we obtain:通过从C（G）= V（D G，G）中减去该表达式，我们得到：

equation 5

where KL is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen–Shannon divergence between the model’s distribution and the data generating process:其中KL是Kullback-Leibler分歧。我们在前面的表达式中认识到模型分布和数据生成过程之间的Jensen-Shannon差异：

equation 6

Since the Jensen–Shannon divergence between two distributions is always non-negative and zero only when they are equal, we have shown that C = -log(4) is the global minimum of C(G) and that the only solution is pg = pdata, i.e., the generative model perfectly replicating the data generating process.由于两个分布之间的Jensen-Shannon分歧总是非负的，只有当它们相等时才为零，我们已经证明了C = -log（4）是C（G）的全局最小值，唯一的解决方案是pg = pdata，即完全复制数据生成过程的生成模型。

4.2. 算法1的收敛性

Proposition 2. If G and D have enough capacity, and at each step of Algorithm 1, the discriminator is allowed to reach its optimum given G, and pg is updated so as to improve the criterion 命题2.如果G和D具有足够的容量，并且在算法1的每一步，允许鉴别器达到其最佳给定G，并且更新pg以便改进标准

screen 2

then pg converges to pdata 然后pg收敛到pdata

Proof. Consider V (G,D) = U(pg,D) as a function of pg as done in the above criterion. Note that U(pg,D) is convex in pg. The subderivatives of a supremum of convex functions include the derivative of the function at the point where the maximum is attained. In other words, if f(x) = sup f(x) and f(x) is convex in x for every a, then @f(x) 2 @f if b= arg sup2A f(x). This is equivalent to computing a gradient descent update for pg at the optimal D given the corresponding G. supD U(pg,D) is convex in pg with a unique global optima as proven in Thm 1, therefore with sufficiently small updates of pg, pg converges to px, concluding the proof.证明。考虑V（G，D）= U（pg，D）作为pg的函数，如在上述标准中所做的那样。注意，U（pg，D）在pg中是凸的。凸函数的上界的子衍生物包括在达到最大值的点处函数的导数。换句话说，如果f（x）= sup？对于每个a，f？（x）和f？（x）在x中是凸的，如果b = arg sup？2A f？（x）则是@f（x）2 @f。这相当于在最优D处计算pg的梯度下降更新给定相应的G. supD U（pg，D）在pg中是凸的，具有在Thm 1中证明的唯一全局最优，因此具有足够小的pg更新， pg汇总到px，得出结论证明。

In practice, adversarial nets represent a limited family of pg distributions via the function G(z; g), and we optimize g rather than pg itself. Using a multilayer perceptron to define G introduces multiple critical points in parameter space. However, the excellent performance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite their lack of theoretical guarantees.在实践中，对抗网络通过函数G（z;g）表示有限的pg分布族，并且我们优化？g而不是pg本身。使用多层感知器定义G在参数空间中引入了多个关键点。然而，多层感知器在实践中的优异性能表明它们是一种合理的模型，尽管它们缺乏理论上的保证。

5. 试验

We trained adversarial nets an a range of datasets including MNIST[23], the Toronto Face Database(TFD) [28], and CIFAR-10 [21]. The generator nets used a mixture of rectifier linear activations [19,9] and sigmoid activations, while the discriminator net used maxout [10] activations. Dropout [17] was applied in training the discriminator net. While our theoretical framework permits the use of dropout and other noise at intermediate layers of the generator, we used noise as the input to only the bottommost layer of the generator network.我们训练了对抗网络的一系列数据集，包括MNIST [23]，多伦多人脸数据库（TFD）[28]和CIFAR-10 [21]。生成网络混合使用RELU激活[19,9]和sigmod激活，而判别网络使用maxout [10]激活。 dropout[17]用于训练判别网络。虽然我们的理论框架允许在生成网络的中间层使用dropout和其他噪声，但我们使用噪声作为生成网络最底层的输入。

We estimate probability of the test set data under pg by fitting a Gaussian Parzen window to the samples generated with G and reporting the log-likelihood under this distribution. The parameter of the Gaussians was obtained by cross validation on the validation set. This procedure was introduced in Breuleux et al. [8] and used for various generative models for which the exact likelihood is not tractable [25, 3, 5]. Results are reported in Table 1. This method of estimating the likelihood has somewhat high variance and does not perform well in high dimensional spaces but it is the best method available to our knowledge. Advances in generative models that can sample but not estimate likelihood directly motivate further research into how to evaluate such models.我们通过将高斯Parzen窗口拟合到用G生成的样本并在该分布下报告对数似然来估计测试集数据在pg下的概率。通过在验证集上的交叉验证获得高斯参数。该程序在Breuleux等人的文章中介绍。 [8]并用于各种生成模型，其确切的可能性是不易处理的[25,3,5]。结果报告在表1中。这种估计可能性的方法具有稍高的方差，并且在高维空间中表现不佳，但它是我们所知的最佳方法。生成模型的进步（可以采样但不能估计可能性）直接激发了对如何评估此类模型的进一步研究。

table 1

Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean loglikelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we computed the standard error across folds of the dataset, with a different chosen using the validation set of each fold. On TFD, was cross validated on each fold and mean log-likelihood on each fold were computed. For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.表1：基于Parzen窗口的对数似然估计。在MNIST上报告的数字是测试集上样本的平均对数似然，并且在示例中计算平均值的标准误差。在TFD上，我们计算了数据集折叠的标准误差，并且有不同的6 使用每个折叠的验证集选择。在TFD上，6在每个折叠上交叉验证并计算每个折叠的平均对数似然。对于MNIST，我们将与其他数据集的实值（而不是二进制）版本进行比较。

In Figures 2 and 3 we show samples drawn from the generator net after training. While we make no claim that these samples are better than samples generated by existing methods, we believe that these samples are at least competitive with the better generative models in the literature and highlight the potential of the adversarial framework.在图2和图3中，我们显示了训练后从生成网络中抽取的样本。虽然我们没有声称这些样本比现有方法生成的样本更好，但我们认为这些样本至少与文献中更好的生成模型竞争，并突出了对抗框架的潜力。

Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator)图2：来自模型的样本的可视化。最右边的列显示最近的相邻样本的训练示例，以证明模型没有记住训练集。样品是公平的随机抽取，而不是挑选。与深度生成模型的大多数其他可视化不同，这些图像显示来自模型分布的实际样本，而不是给定隐藏单元样本的条件均值。而且，这些样品是不相关的，因为取样过程不依赖于马尔可夫链混合。 a）MNIST b）TFD c）CIFAR-10（完全连接模型）d）CIFAR-10（卷积鉴别器和“反卷积”发生器）

Figure 3: Digits obtained by linearly interpolating between coordinates in z space of the full model.图3：通过在完整模型的z空间中的坐标之间线性插值获得的数字。

6. 优势和劣势

This new framework comes with advantages and disadvantages relative to previous modeling frameworks. The disadvantages are primarily that there is no explicit representation of pg(x), and that D must be synchronized well with G during training (in particular, G must not be trained too much without updatingD, in order to avoid “the Helvetica scenario” in which G collapses too many values of z to the same value of x to have enough diversity to model pdata), much as the negative chains of a Boltzmann machine must be kept up to date between learning steps. The advantages are that Markov chains are never needed, only backprop is used to obtain gradients, no inference is needed during learning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes the comparison of generative adversarial nets with other generative modeling approaches.与以前的建模框架相比，这个新框架具有优点和缺点。缺点主要在于没有pg（x）的明确表示，并且D在训练期间必须与G很好地同步（特别是，在没有更新D的情况下，G不得过多训练，以避免“Helvetica场景”： G将太多的z值折叠到x的相同值以具有足够的多样性来模拟pdata），就像Boltzmann机器的负链必须在学习步骤之间保持最新一样。优点是永远不需要马尔可夫链，只有backprop用于获得渐变，学习期间不需要推理，并且可以将多种功能合并到模型中。表2总结了生成性对抗网与其他生成建模方法的比较。

table 2

功能	Deep directed graphical models深度定向图形模型	Deep undirected graphical models深度无向图形模型	Generative autoencoders生成自编码器	Adversarial models对抗模型
Training	Inference needed during training.	Inference needed during training. MCMC needed to approximate partition function gradient.训练期间需要推理。 MCMC需要近似分区函数梯度。	Enforced tradeoff between mixing and power of reconstruction generation在混合和重建之间进行强制权衡	Synchronizing the discriminator with the generator. Helvetica. 使鉴别器与生成器同步。黑体。
Inference推理	Learned approximate inference 学习近似推理	Variational inference	MCMC-based inference	Learned approximate inference
Sampling	No difficulties	Requires Markov chain	Requires Markov chain	No difficulties
Evaluating p(x) 评估	Intractable, may be approximated with AIS难以处理，可以用AIS近似	Intractable, may be approximated with AIS	Not explicitly represented, may be approximated with Parzen density estimation未明确表示，可以用Parzen密度估计近似	Not explicitly represented, may be approximated with Parzen density estimation
Model design	Nearly all models incur extreme difficulty几乎所有模型都会遇到极大的困难	Careful design needed to ensure multiple properties 需要精心设计以确保多种属性	Any differentiable function is theoretically permitted 理论上允许任何可微分的函数	Any differentiable function is theoretically permitted

Table 2: Challenges in generative modeling: a summary of the difficulties encountered by different approaches to deep generative modeling for each of the major operations involving a model.表2：生成建模中的挑战：对涉及模型的每个主要操作的深度生成建模的不同方法遇到的困难的总结。

The aforementioned advantages are primarily computational. Adversarial models may also gain some statistical advantage from the generator network not being updated directly with data examples, but only with gradients flowing through the discriminator. This means that components of the input are not copied directly into the generator’s parameters. Another advantage of adversarial networks is that they can represent very sharp, even degenerate distributions, while methods based on Markov chains require that the distribution be somewhat blurry in order for the chains to be able to mix between modes.上述优点主要是计算的。对抗模型也可以从生成模型获得一些统计优势，而不是直接用数据示例更新，而是仅通过流经鉴别器的梯度。这意味着输入的组件不会直接复制到生成器的参数中。对抗性网络的另一个优点是它们可以表示非常尖锐，甚至退化的分布，而基于马尔可夫链的方法要求分布有些模糊，以便链能够在模式之间混合。

7. 结论和未来的工作

This framework admits many straightforward extensions:该框架吸收了许多简单的扩展：

A conditional generative model p(x

c) can be obtained by adding c as input to both G and D.条件生成模型p（x

c）可以通过添加c作为G和D的输入来获得。

Learned approximate inference can be performed by training an auxiliary network to predict z given x. This is similar to the inference net trained by the wake-sleep algorithm [15] but with the advantage that the inference net may be trained for a fixed generator net after the generator net has finished training.可以通过训练辅助网络来预测给定x的z来进行学习的近似推断。这类似于由唤醒 - 睡眠算法[15]训练的推理网，但具有以下优点：在生成网络完成训练之后，可以针对固定的生成网络训练鉴别网络。
One can approximately model all conditionals p(xS j x6S) where S is a subset of the indices of x by training a family of conditional models that share parameters. Essentially, one can use adversarial nets to implement a stochastic extension of the deterministic MP-DBM [11].可以通过训练共享参数的条件模型族来近似地模拟所有条件p（xS | x6S），其中S是x的索引的子集。从本质上讲，人们可以使用对抗网来实现确定性MP-DBM的随机扩展[11]。
Semi-supervised learning: features from the discriminator or inference net could improve performance of classifiers when limited labeled data is available.半监督学习：当有限的标记数据可用时，来自鉴别器或推理网的特征可以提高分类器的性能。
Efficiency improvements: training could be accelerated greatly by divising better methods for coordinating G and D or determining better distributions to sample z from during training.提高效率：通过划分更好的协调G和D的方法或确定在培训期间更好地分配样本z，可以大大加快培训。

This paper has demonstrated the viability of the adversarial modeling framework, suggesting that these research directions could prove useful.本文证明了对抗性建模框架的可行性，表明这些研究方向可能有用。

致谢

We would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume Alain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window evaluation code with us. We would like to thank the developers of Pylearn2 [12] and Theano [7, 1], particularly Fr´ed´eric Bastien who rushed a Theano feature specifically to benefit this project. Arnaud Bergeron provided much-needed support with LATEX typesetting. We would also like to thank CIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Qu´ebec for providing computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in Deep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.我们要感谢Patrice Marcotte，Olivier Delalleau，Kyunghyun Cho，Guillaume Alain和Jason Yosinski的有益讨论。 Yann Dauphin与我们分享了他的Parzen窗口评估代码。我们要感谢Pylearn2 [12]和Theano [7,1]的开发人员，尤其是Fr’ed’eric Bastien，他特意推出了Theano功能以使该项目受益。 Arnaud Bergeron为LATEX排版提供了急需的支持。我们还要感谢CIFAR和加拿大研究主席的资助，以及Compute Canada和Calcul Qu’ebec提供的计算资源。 Ian Goodfellow得到2013年谷歌深度学习奖学金的支持。最后，我们要感谢Les Trois Brasseurs激发我们的创造力。

参考文献

[1] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
[2] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.
[3] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep representations. In ICML’13.
[4] Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013b). Generalized denoising auto-encoders as generative models. In NIPS26. Nips Foundation.
[5] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable by backprop. In ICML’14.
[6] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic networks trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning (ICML’14).
[7] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
[8] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an RBM-derived process. Neural Computation, 23(8), 2053–2073.
[9] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS’2011.
[10] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks. In ICML’2013.
[11] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. In NIPS’2013.
[12] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.
[13] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS’2010.
[14] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
[15] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1558–1161.
[16] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
[17] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
[18] Hyv¨arinen, A. (2005). Estimation of non-normalized statistical models using score matching. J. Machine Learning Res., 6.
[19] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153. IEEE.
[20] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
[21] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
[22] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS’2012.
[23] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
[24] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. Technical report, arXiv:1401.4082.
[25] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive auto-encoders. In ICML’12.
[26] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448–455.
[27] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge.
[28] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010-001, U. Toronto.
[29] Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, pages 1064–1071. ACM.
[30] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML 2008.
[31] Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65(3), 177–228.

Read All

tensorflow使用高阶api导致训练不收敛问题

2018-09-10

tensorflow estimator keras

译文人脸识别机器学习
- 摘要
- 引言
- 主题
- 总结
- 附录
- 参考
摘要

本文将低级api实现的tensorflow网络移植到高级api上遇到的loss值不变和训练结果不收敛问题

引言

tensorflow版本更新很快，猛一回头发现已经推出更高级的api了

主题

tensorflow高级api

上图是tensorflow软件栈图，我之前学习和实现的网络模型（0.12a）使用的是低级api, 现在的新版本（1.10）对低级api进行了封装，形成了高级api(estimator keras)，所以对原有模型进行了一次api替换

移植

整个代码的移植过程还是比较顺利的，移植完成以后，代码量比以前减少了很多，但移植完成，在运行时却发现了一些奇怪的现象

问题现象

我这里有两个自己的数据集，其中一个数据集（单个数据量130）在移植后的代码上运行正常，另一个数据集（单个数据集60000）在移植后的代码上运行却出现以下问题：
- 二分类问题的准确率在50%左右
- 训练过程中loss值会变化到一个固定值，然后就不再变化了
```
epoch:0

Evaluation results:	{'true_negatives': 48.0, 'global_step': 0, 'loss': 0.7953733, 'true_positives': 0.0, 'accuracy': 0.48, 'false_negatives': 52.0, 'false_positives': 0.0}

epoch:1

Evaluation results:	{'true_negatives': 0.0, 'global_step': 1, 'loss': 0.79326165, 'true_positives': 52.0, 'accuracy': 0.52, 'false_negatives': 0.0, 'false_positives': 48.0}

epoch:2

Evaluation results:	{'true_negatives': 0.0, 'global_step': 2, 'loss': 0.79326165, 'true_positives': 52.0, 'accuracy': 0.52, 'false_negatives': 0.0, 'false_positives': 48.0}

epoch:3

Evaluation results:	{'true_negatives': 0.0, 'global_step': 3, 'loss': 0.79326165, 'true_positives': 52.0, 'accuracy': 0.52, 'false_negatives': 0.0, 'false_positives': 48.0}

epoch:4

Evaluation results:	{'true_negatives': 0.0, 'global_step': 4, 'loss': 0.79326165, 'true_positives': 52.0, 'accuracy': 0.52, 'false_negatives': 0.0, 'false_positives': 48.0}

epoch:5

Evaluation results:	{'true_negatives': 0.0, 'global_step': 5, 'loss': 0.79326165, 'true_positives': 52.0, 'accuracy': 0.52, 'false_negatives': 0.0, 'false_positives': 48.0}
```
问题分析

因为不同数据集对应的结论不同，所以问题的排查就主要集中在对比两份代码的差异上了。训练正常的代码简称为T代码，训练异常的代码简称为F代码
- 网络排查
  
  怀疑模型代码有问题。将F代码尽量用T代码代替，最终替换后，只剩下tfrecord文件读取和解析、网络输入层不一样，然后再训练更新后的F代码，现象依然存在
- 数据排查
  
  通过网络排查基本排除网络问题，唯一不同在于数据，于是进行数据验证。将F代码训练过程中的输入数据记录到文件，然后对比F代码读到的数据和制作tfrecord的数据。排查数据编号、数据内容是否一致。最终发现数据一致。 ``` class _LoggerHook(tf.train.SessionRunHook): “"”Logs loss and runtime.”””
  
  def begin(self): # print(‘begin’) self._step = -1
  
  def before_run(self, run_context): # print(‘before_run’) self._step += 1 return tf.train.SessionRunArgs(features) # Asks for loss value.
  
  def after_run(self, run_context, run_values): if self._step == 2: logit_value = run_values.results print(‘step ‘ + str(self._step) + ‘, features = ‘ + str(logit_value)) f1.write(logit_value[‘data’]) # 训练准确率写入文件 f1.flush() numpy.savetxt(r”/home/zq537/ckpt/ecg_data.txt”, logit_value[‘data’]) numpy.savetxt(r”/home/zq537/ckpt/index.txt”, logit_value[‘name’]) ```
- 内部训练过程排查
  
  如果网络和数据都没问题，那么问题排查起来就比较困难了。接下来的方向可能需要深入模型内部的训练过程，看哪些步骤导致loss和准确率不变，将所有操作的输出和反向传播的梯度都记录到tensorboard中进行查看，发现进行少量的训练后，反向传播的梯度值分布都在0附近，这样网络权重基本就不会更新了，网络参数没有变化，自然准确率和loss也不会变化了。是什么原因导致梯度为0？梯度是从loss开始，一层一层往前传的；而loss是由预测值和实际标签共同决定的。于是我开始查看预测值和实际标签的数值。先把数据集进行简化，生成10个数据的数据集，batch-size设为10，然后在训练回调中把稳定（loss不变）时的预测值和实际标签打印出来，发现了问题：稳定状态下预测值大部分两分类的准确率都是1，这很明显是给网络的评判标准（loss函数）有问题 ``` labels = [[1 0] [1 0] [0 1] [1 0] [1 0] [0 1] [0 1] [0 1] [0 1] [0 1]]
  
  logits = [[1.0000000e+00 1.0000000e+00] [1.0000000e+00 1.0000000e+00] [1.0000000e+00 1.0000000e+00] [1.0000000e+00 1.4896728e-36] [1.0000000e+00 5.1295980e-18] [1.0000000e+00 1.0000000e+00] [1.0000000e+00 1.0000000e+00] [1.0000000e+00 1.0000000e+00] [1.0000000e+00 1.0000000e+00] [1.0000000e+00 1.0000000e+00]] ```
  
  新代码中使用的损失函数是tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=logits)，搜索并替换为tensorflow官方module库中的损失函数tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)再进行训练，发现一切正常
问题原因
- 网络排查
- 数据排查
- 内部训练过程排查
解决方案

总结
- 在pypi官网上找相关模块信息
  
  最开始在网上搜到的方案是ConcurrentLogHandler，但在13年就停止维护了，合入代码也无法运行。于是又在网上找其他方案（这里浪费了不少时间），其实 ConcurrentLogHandler的homepage页已经说明了替代的库
附录

参考
- tensorflow官方文档
Read All
译文 FaceNet: A Unified Embedding for Face Recognition and Clustering

2018-08-31

机器学习

译文人脸识别机器学习
摘要

Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.尽管最近在人脸识别领域取得了重大进展[10,14,15,17]，但大规模有效地实施面部验证和识别对当前方法提出了严峻挑战。在本文中，我们提出了一个名为FaceNet的系统，它直接学习从面部图像到紧凑的欧几里德空间的映射，其中距离直接对应于面部相似性的度量。生成此空间后，可以使用FaceNet嵌入作为特征向量的标准技术轻松实现面部识别，验证和聚类等任务。

Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face.我们的方法使用深度卷积网络训练直接优化嵌入本身，而不是像以前的深度学习方法那样的中间瓶颈层。为了训练，我们使用使用新颖的在线三重挖掘方法生成的大致对齐的匹配/非匹配面部补丁的三元组。我们的方法的好处是更高的表现效率：我们使用每面只有128个字节来实现最先进的面部识别性能。

On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [15] by 30% on both datasets.在广泛使用的Labeled Faces in the Wild（LFW）数据集中，我们的系统实现了99.63％的新记录准确率。在YouTube Faces DB上，它达到了95.12％。与两个数据集中的最佳发布结果[15]相比，我们的系统将错误率降低了30％。

We also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.我们还介绍了谐波嵌入和谐波三重态损耗的概念，它描述了不同版本的面嵌入（由不同网络产生），它们彼此兼容并允许彼此之间的直接比较。

1. 介绍

In this paper we present a unified system for face verification (is this the same person), recognition (who is this person) and clustering (find common people among these faces). Our method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space directly correspond to face similarity: faces of the same person have small distances and faces of distinct people have large distances.在本文中，我们提出了一个统一的面部验证系统（在这里是指同一个人），识别（谁是这个人）和聚类（在这些面孔中找到相同的人）。我们的方法基于使用深度卷积网络学习每个图像的欧几里德嵌入。训练网络使得嵌入空间中的平方L2距离直接对应于面部相似性：同一人的面部具有小距离并且不同人的面部具有大距离。

Once this embedding has been produced, then the aforementioned tasks become straight-forward: face verification simply involves thresholding the distance between the two embeddings; recognition becomes a k-NN classification problem; and clustering can be achieved using off-theshelf techniques such as k-means or agglomerative clustering.一旦产生了这种嵌入，则上述任务变得直截了当：面部验证仅涉及对两个嵌入之间的距离进行阈值处理; 识别任务成为k-NN分类问题; 并且可以使用诸如k均值或凝聚聚类之类的现有技术来实现聚类。

Previous face recognition approaches based on deep networks use a classification layer [15, 17] trained over a set of known face identities and then take an intermediate bottle-neck layer as a epresentation used to generalize recognition beyond the set of identities used in training. The downsides of this approach are its indirectness and its inefficiency: one has to hope that the bottleneck representation generalizes well to new faces; and by using a bottleneck layer the representation size per face is usually very large (1000s of dimensions). Some recent work [15] has reduced this dimensionality using PCA, but this is a linear transformation that can be easily learnt in one layer of the network.先前基于深度网络的面部识别方法使用在一组已知面部身份上训练的分类层[15,17]，然后采用中间瓶颈层作为表示，用于概括超出训练中使用的身份集合的识别。这种方法的缺点是它的间接性和效率低下：人们不得不希望瓶颈表现能够很好地概括为新面孔; 通过使用瓶颈层，每个面的表示大小通常非常大（1000维）。最近的一些工作[15]使用PCA降低了这种维度，但这是一种线性转换，可以在网络的一个层中轻松学习。

In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a tripletbased loss function based on LMNN [19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed.与这些方法相比，FaceNet直接将其输出训练为使用基于LMNN的使用三重损耗函数的紧凑128-D嵌入[19]。我们的三元组由两个匹配的面部缩略图和一个不匹配的面部缩略图组成，并且损失的目标是将正对与负对分开一个距离边距。缩略图是面部区域的紧密裁剪，除了缩放和平移之外，没有2D或3D对齐。

Choosing which triplets to use turns out to be very important for achieving good performance and, inspired by curriculum learning [1], we present a novel online negative exemplar mining strategy which ensures consistently increasing difficulty of triplets as the network trains. To improve clustering accuracy, we also explore hard-positive mining techniques which encourage spherical clusters for the embeddings of a single person.选择使用哪些三元组对于实现良好的表现非常重要，并且受Curriculum learning的启发[1]，我们提出了一种新颖的在线负面样本挖掘策略，确保在网络训练时不断增加三元组的难度。为了提高聚类精度，我们还探索了硬阳性挖掘技术，该技术鼓励球形聚类用于嵌入单个人。

Figure 1. Illumination and Pose invariance. Pose and illumination have been a long standing problem in face recognition. This figure shows the output distances of FaceNet between pairs of faces of the same and a different person in different pose and illumination combinations. A distance of 0:0 means the faces are identical, 4:0 corresponds to the opposite spectrum, two different identities. You can see that a threshold of 1.1 would classify every pair correctly.图1.照明和姿势不变性。姿势和照明是人脸识别中长期存在的问题。该图显示了FaceNet在不同姿势和照明组合中相同和不同人的面对之间的输出距离。距离0.0表示面相同，4.0对应相反的光谱，两个不同的身份。您可以看到1.1的阈值会正确地对每一对进行分类。

As an illustration of the incredible variability that our method can handle see Figure 1. Shown are image pairs from PIE [13] that previously were considered to be very difficult for face verification systems.作为我们的方法可以处理的令人难以置信的可变性的说明，请参见图1.显示来自PIE [13]的图像对，之前被认为对于面部验证系统来说非常困难。

An overview of the rest of the paper is as follows: in section 2 we review the literature in this area; section 3.1 defines the triplet loss and section 3.2 describes our novel triplet selection and training procedure; in section 3.3 we describe the model architecture used. Finally in section 4 and 5 we present some quantitative results of our embeddings and also qualitatively explore some clustering results.本文其余部分的概述如下：在第2节中，我们回顾了该领域的文献; 第3.1节定义了三重态损失，第3.2节描述了我们新颖的三重态选择和训练程序; 在3.3节中，我们描述了使用的模型架构。最后在第4节和第5节中，我们提供了嵌入的一些定量结果，并定性地探索了一些聚类结果。

2. 相关工作

Similarly to other recent works which employ deep networks [15, 17], our approach is a purely data driven method which learns its representation directly from the pixels of the face. Rather than using engineered features, we use a large dataset of labelled faces to attain the appropriate invariances to pose, illumination, and other variational conditions.与其他最近使用深度网络的作品[15,17]类似，我们的方法是一种纯粹的数据驱动方法，它直接从面部像素中学习它的表示。我们使用标记过人脸的大型数据集来获得姿势，光照和其他变化条件的适当不变性，而不是使用工程特征。

In this paper we explore two different deep network architectures that have been recently used to great success in the computer vision community. Both are deep convolutional networks [8, 11]. The first architecture is based on the Zeiler&Fergus [22] model which consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers. We additionally add several 11d convolution layers inspired by the work of [9]. The second rchitecture is based on the Inception model of Szegedy et al. which was recently used as the winning approach for ImageNet 2014 [16]. These networks use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. We have found that these models can reduce the number of parameters by up to 20 times and have the potential to reduce the number of FLOPS required for comparable performance.在本文中，我们探讨了最近在计算机视觉社区中取得巨大成功的两种不同的深度网络架构。两者都是深度卷积网络[8,11]。第一种架构基于Zeiler＆Fergus [22]模型，该模型由多个交错的卷积层，非线性激活，局部响应归一化和最大池化层组成。我们还增加了几个1*1*d卷积层，灵感来自[9]的工作。第二种结构基于Szegedy等人的Inception模型。最近被用作ImageNet 2014的获胜方法[16]。这些网络使用混合层，并行运行几个不同的卷积和池化层并连接它们的响应。我们发现这些模型可以将参数数量减少多达20倍，并且有可能减少可比性能所需的FLOPS数量。

There is a vast corpus of face verification and recognition works. Reviewing it is out of the scope of this paper so we will only briefly discuss the most relevant recent work.目前有大量的面部验证和识别工作。审查它超出了本文的范围，因此我们将仅简要讨论最相关的最新工作。

The works of [15, 17, 23] all employ a complex system of multiple stages, that combines the output of a deep convolutional network with PCA for dimensionality reduction and an SVM for classification.论文[15,17,23]都使用了一个复杂的多级系统，它将深度卷积网络的输出与PCA相结合，以降低维数，并将SVM用于分类。

Zhenyao et al. [23] employ a deep network to “warp” faces into a canonical frontal view and then learn CNN that classifies each face as belonging to a known identity. For face verification, PCA on the network output in conjunction with an ensemble of SVMs is used.Zhenyao等 [23]采用深度网络将面部“扭曲”成规范的正面视图，然后学习CNN，将每个面部分类为属于已知身份。对于面部验证，使用网络输出上的PCA和一组SVM。

Taigman et al. [17] propose a multi-stage approach that aligns faces to a general 3D shape model. A multi-class network is trained to perform the face recognition task on over four thousand identities. The authors also experimented with a so called Siamese network where they directly optimize the L1-distance between two face features. Their best performance on LFW (97:35%) stems from an ensemble of three networks using different alignments and color channels. The predicted distances (non-linear SVM predictions based on the 2 kernel) of those networks are combined using a non-linear SVM.Taigman等 [17]提出了一种多阶段方法，将面部与一般的三维形状模型对齐。训练多级网络以执行超过四千个特征的面部识别任务。作者还试验了一个所谓的连体网络，他们直接优化了两个面部特征之间的L1距离。他们在LFW上的最佳表现（97.35％）源于使用不同比对和颜色通道的三个网络的集合。使用非线性SVM组合这些网络的预测距离（基于α2内核的非线性SVM预测）。

Sun et al. [14, 15] propose a compact and therefore relatively cheap to compute network. They use an ensemble of 25 of these network, each operating on a different face patch. For their final performance on LFW (99:47% [15]) the authors combine 50 responses (regular and flipped). Both PCA and a Joint Bayesian model [2] that effectively correspond to a linear transform in the embedding space are employed. Their method does not require explicit 2D/3D alignment. The networks are trained by using a combination of classification and verification loss. The verification loss is similar to the triplet loss we employ [12, 19], in that it minimizes the L2-distance between faces of the same identity and enforces a margin between the distance of faces of different identities. The main difference is that only pairs of images are compared, whereas the triplet loss encourages a relative distance constraint.孙等 [14,15]提出了一种紧凑且因此相对便宜的计算网络。他们使用25个这样的网络集合，每个网络在不同的面部补丁上运行。对于他们在LFW上的最终表现（99：47％[15]），作者结合了50个回答（常规和翻转）。 PCA和联合贝叶斯模型[2]都有效地对应于嵌入空间中的线性变换。他们的方法不需要明确的2D / 3D对齐。通过使用分类和验证丢失的组合损失函数来训练网络。验证损失函数类似于我们采用的三元组损失[12,19]，因为它最小化了相同身份的面部之间的L2距离，并在不同身份的面部距离之间实施了边界。主要区别在于仅比较成对图像，而三元组损失促使相对距离约束。

A similar loss to the one used here was explored in Wang et al. [18] for ranking images by semantic and visual similarity.Wang等人研究了与此处使用的类似的损失。 [18]用于通过语义和视觉相似性对图像进行排序。

3. 方法

FaceNet uses a deep convolutional network. We discuss two different core architectures: The Zeiler&Fergus [22] style networks and the recent Inception [16] type networks. The details of these networks are described in section 3.3.FaceNet使用深度卷积网络。我们讨论了两种不同的核心架构：Zeiler＆Fergus [22]式网络和最近的Inception [16]型网络。这些网络的细节在3.3节中描述。

Figure 2. Model structure. Our network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.图2.模型结构。我们的网络由批量输入层和深度CNN组成，然后进行L2归一化，从而实现面嵌入。接下来是训练期间的三元组损失。

Given the model details, and treating it as a black box (see Figure 2), the most important part of our approach lies in the end-to-end learning of the whole system. To this end we employ the triplet loss that directly reflects what we want to achieve in face verification, recognition and clustering. Namely, we strive for an embedding f(x), from an image x into a feature space Rd, such that the squared distance between all faces, independent of imaging conditions, of the same identity is small, whereas the squared distance between a pair of face images from different identities is large.将模型细节视为黑盒子（见图2），我们方法中最重要的部分在于整个系统的端到端学习。为此，我们采用三元组损失，直接反映了我们想要在面部验证，识别和聚类中实现的目标。即，我们努力嵌入f（x），从图像x到特征空间Rd，使得相同身份的所有面之间的平方距离（与成像条件无关）很小，而一对之间的平方距离很小。来自不同身份的面部图像很大。

Although we did not directly compare to other losses, e.g. the one using pairs of positives and negatives, as used in [14] Eq. (2), we believe that the triplet loss is more suitable for face verification. The motivation is that the loss from [14] encourages all faces of one identity to be projected onto a single point in the embedding space. The triplet loss, however, tries to enforce a margin between each pair of faces from one person to all other faces. This allows the faces for one identity to live on a manifold, while still enforcing the distance and thus discriminability to other identities.虽然我们没有直接与其他损失进行比较，例如使用[14] Eq中使用的正和负对的那个。（2），我们认为三联体损失更适合面部识别。动机是[14]的损失鼓励将一个身份的所有面部投射到嵌入空间中的单个点上。然而，三重态损失试图在从一个人到所有其他面部的每对面部之间强制执行边缘。这允许一个身份的面部应用在个人的多样上，同时仍然强制距离并因此可以与其他身份相区别。

The following section describes this triplet loss and how it can be learned efficiently at scale.以下部分描述了这种三元组损失以及如何有效地大规模学习它。

3.1. 三元组损失

Figure 3. The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.图3.三联体损失最小化锚和阳性之间的距离，两者具有相同的身份，并最大化锚和不同身份的阴性之间的距离。

The embedding is represented by f(x) 2 Rd. It embeds an image x into a d-dimensional Euclidean space. Additionally, we constrain this embedding to live on the d-dimensional hypersphere, i.e. kf(x)k2 = 1. This loss is motivated in [19] in the context of nearest-neighbor classification. Here we want to ensure that an image xai (anchor) of a specific person is closer to all other images xpi (positive) of the same person than it is to any image xni (negative) of any other person. This is visualized in Figure 3.嵌入由f（x）2 Rd表示。它将图像x嵌入到d维欧几里德空间中。另外，我们将这种嵌入限制在d维超球面上，即kf（x）k2 = 1.这种损失在[19]中在最近邻分类的背景下被激发。在这里，我们希望确保特定人的图像xai（锚）更接近同一个人的所有其他图像xpi（正面），而不是任何其他人的任何图像xni（负面）。见图3中的图示。因此

where a is a margin that is enforced between positive and negative pairs. T is the set of all possible triplets in the training set and has cardinality N.在公式里a是正负对之间区分界限的阈值。 T是训练集中所有可能的三元组的集合，并具有基数N.损失函数定义为

Generating all possible triplets would result in many triplets that are easily satisfied (i.e. fulfill the constraint in Eq. (1)). These triplets would not contribute to the training and result in slower convergence, as they would still be passed through the network. It is crucial to select hard triplets, that are active and can therefore contribute to improving the model. The following section talks about the different approaches we use for the triplet selection.生成所有可能的三元组将导致许多容易满足的三元组（即满足方程（1）中的约束）。这些三元组不会对训练有所贡献，导致收敛速度变慢，因为它们仍会通过网络传递。选择活跃的硬三元组至关重要，因此可以有助于改进模型。下一节讨论了我们用于三元组选择的不同方法。

3.2. 三元组的选择

In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint in Eq. (1). This means that, given xai, we want to select an xpi (hard positive) such that fff and similarly xni(hard negative) such that ddd。为了确保快速收敛，选择违反方程式（1）中的三元组约束的三元组是至关重要的。这意味着，对于给定的xai，我们需要选择一个xpi（硬正）和xni（硬阴性），使得和接近。

It is infeasible to compute the argmin and argmax across the whole training set. Additionally, it might lead to poor training, as mislabelled and poorly imaged faces would dominate the hard positives and negatives. There are two obvious choices that avoid this issue:在整个训练集中计算argmin和argmax是不可行的。此外，它可能导致训练不佳，因为错误标记和不良成像的面孔将主导硬性积极和消极。有两个明显的选择可以避免这个问题：
- Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.使用最新的网络检查点并在数据的子集上计算argmin和argmax，每n个步骤生成三元组。
- Generate triplets online. This can be done by selecting the hard positive/negative exemplars from within a mini-batch.在线生成三元组。这可以通过从小批量中选择硬正/负样本来完成。
Here, we focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the argmin and argmax within a mini-batch.在这里，我们专注于在线生成并使用大约几千个样本的大型小批量，并且仅在小批量中计算argmin和argmax。

To have a meaningful representation of the anchorpositive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each mini-batch. In our experiments we sample the training data such that around 40 faces are selected per identity per minibatch. Additionally, randomly sampled negative faces are added to each mini-batch.为了有效地表示锚定距离，需要确保每个小批量中存在任何一个身份的最小数量的样本。在我们的实验中，我们对训练数据进行采样，使得每个小批量每个身份选择约40个面部。另外，随机抽样的负面被添加到每个小批量。

Instead of picking the hardest positive, we use all anchorpositive pairs in a mini-batch while still selecting the hard negatives. We don’t have a side-by-side comparison of hard anchor-positive pairs versus all anchor-positive pairs within a mini-batch, but we found in practice that the all anchorpositive method was more stable and converged slightly faster at the beginning of training.我们不是挑选最差的正样本，而是在小批量中使用所有正样本锚定对，同时仍然选择硬阴性。我们没有对一个小批次中的硬锚阳性对与所有锚 - 阳对进行并排比较，但我们在实践中发现所有锚定阳性方法在开始时更稳定并且收敛得稍快一些训练。

We also explored the offline generation of triplets in conjunction with the online generation and it may allow the use of smaller batch sizes, but the experiments were inconclusive.我们还与在线生成一起探索了三联体的离线生成，并且可能允许使用更小的批量，但实验尚无定论。

Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. f(x) = 0). In order to mitigate this, it helps to select xni such that选择最难的负样本实际上可能在训练早期导致不良的局部最小值，特别是它可能导致模型崩溃（如f（x）= 0）。为了减轻这种影响，用如下条件选择xni会有所帮助

We call these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchorpositive distance. Those negatives lie inside the margin .我们将这些负面样本称为半硬，因为它们远离锚点而不是正样本，但仍然很难，因为平方距离接近锚定距离。这些负面影响在阈值范围内。

As mentioned before, correct triplet selection is crucial for fast convergence. On the one hand we would like to use small mini-batches as these tend to improve convergence during Stochastic Gradient Descent (SGD) [20]. On the other hand, implementation details make batches of tens to hundreds of exemplars more efficient. The main constraint with regards to the batch size, however, is the way we select hard relevant triplets from within the mini-batches. In most experiments we use a batch size of around 1,800 exemplars.如前所述，正确的三元组选择对于快速收敛至关重要。一方面，我们希望使用小型小批量，因为这些趋向于在随机梯度下降（SGD）期间改善收敛[20]。另一方面，实施细节使得数十到数百个样本的批次更有效。然而，关于批量大小的主要限制是我们从小批量中选择硬相关三元组的方式。在大多数实验中，我们使用的批量大小约为1,800个样本。

3.3. 深度卷积网络

In all our experiments we train the CNN using Stochastic Gradient Descent (SGD) with standard backprop [8, 11] and AdaGrad [5]. In most experiments we start with a learning rate of 0:05 which we lower to finalize the model. The models are initialized from random, similar to [16], and trained on a CPU cluster for 1,000 to 2,000 hours. The decrease in the loss (and increase in accuracy) slows down drastically after 500h of training, but additional training can still significantly improve performance. The margin a is set to 0:2.在我们的所有实验中，我们使用随机梯度下降（SGD）的标准反向传播[8,11]和AdaGrad [5]训练CNN。在大多数实验中，我们从学习率0.05开始，我们降低以完成模型。模型从随机初始化，类似于[16]，并在CPU集群上训练1,000到2,000小时。在训练500小时后，损失的减少（以及准确度的增加）急剧减慢，但额外的训练仍然可以显着提高性能。边距a设置为0.2。

We used two types of architectures and explore their trade-offs in more detail in the experimental section. Their practical differences lie in the difference of parameters and FLOPS. The best model may be different depending on the application. E.g. a model running in a datacenter can have many parameters and require a large number of FLOPS, whereas a model running on a mobile phone needs to have few parameters, so that it can fit into memory. All our models use rectified linear units as the non-linear activation function.我们使用了两种类型的体系结构，并在实验部分中更详细地探讨了它们的权衡。它们的实际差异在于参数和FLOPS的差异。根据应用，最佳型号可能会有所不同。例如。在数据中心中运行的模型可以具有许多参数并且需要大量FLOPS，而在移动电话上运行的模型需要具有很少的参数，以便它可以适合存储器。我们所有的模型都使用整流线性单元（RELU）作为非线性激活函数。

Table 1. NN1. This table show the structure of our Zeiler&Fergus [22] based model with 1*1 convolutions inspired by [9]. The input and output sizes are described in rows*cols*#filters. The kernel is specified as rows*cols; stride and the maxout [6] pooling size as p = 2.表1. NN1。该表显示了我们的Zeiler＆Fergus [22]模型的结构，其灵感来自于[9]的1*1卷积。输入和输出大小描述为 rows*cols*#filters。内核被指定为rows*cols; 步幅和maxout [6]的池化大小为p = 2。

The first category, shown in Table 1, adds 11d convolutional layers, as suggested in [9], between the standard convolutional layers of the Zeiler&Fergus [22] architecture and results in a model 22 layers deep. It has a total of 140 million parameters and requires around 1.6 billion FLOPS per image.第一类结构，如表1所示，如[9]中建议，在Zeiler＆Fergus [22]架构的标准卷积层之间增加1*1*d卷积层，得到22层深的模型。它总共有1.4亿个参数，每个图像需要大约16亿FLOPS。

Table 2. NN2. Details of the NN2 Inception incarnation. This model is almost identical to the one described in [16]. The two major differences are the use of L2 pooling instead of max pooling (m), here specified. I.e. instead of taking the spatial max the L2 norm is computed. The pooling is always 33 (aside from the final average pooling) and in parallel to the convolutional modules inside each Inception module. If there is a dimensionality reduction after the pooling it is denoted with p. 11, 33, and 55 pooling are then concatenated to get the final output.表2.NN2。 NN2 Inception 实现的详细信息。该模型几乎与[16]中描述的模型相同。两个主要区别是使用L2池化而不是最大池化（m），这里指定。即不是取空间最大值，而是计算L2范数。池化窗总是3*3（除了最终的平均池化），并且与每个Inception模块内的卷积模块并行。如果在池化后存在降维，则用p表示。然后连接1*1,3*3和5*5池化会汇集以获得最终输出。

Figure 4. FLOPS vs. Accuracy trade-off. Shown is the trade-off between FLOPS and accuracy for a wide range of different model sizes and architectures. Highlighted are the four models that we focus on in our experiments.图4. FLOPS与准确性的权衡。所示为FLOPS与各种不同型号和架构的精度之间的权衡。重点介绍了我们在实验中关注的四种模型。

The second category we use is based on GoogLeNet style Inception models [16]. These models have 20* fewer parameters (around 6.6M-7.5M) and up to 5fewer FLOPS (between 500M-1.6B). Some of these models are dramatically reduced in size (both depth and number of filters), so that they can be run on a mobile phone. One, NNS1, has 26M parameters and only requires 220M FLOPS per image. The other, NNS2, has 4.3M parameters and 20M FLOPS. Table 2 describes NN2 our largest network in detail. NN3 is identical in architecture but has a reduced input size of 160x160. NN4 has an input size of only 96x96, thereby drastically reducing the CPU requirements (285M FLOPS vs 1.6B for NN2). In addition to the reduced input size it does not use 5x5 convolutions in the higher layers as the receptive field is already too small by then. Generally we found that the 5x5 convolutions can be removed throughout with only a minor drop in accuracy. Figure 4 compares all our models.我们使用的第二类是基于GoogLeNet风格的Inception模型组[16]。部分模型使参数减少20倍（约6.6M-7.5M），FLOPS减少5倍（500M-1.6B之间）。其中一些型号的尺寸（深度和滤波器数量）都大幅减少，因此可以在手机上运行。一，NNS1，具有26M参数，每个图像仅需要220M FLOPS。另一个是NNS2，有4.3M参数和20M FLOPS。表2详细描述了我们最大的网络NN2。 NN3在架构上是相同的，但输入尺寸减小了160x160。 NN4的输入大小仅为96x96，从而大幅降低了CPU要求（NN2为285M FLOPS对1.6B）。除了减小的输入尺寸之外，它在较高层中不使用5x5卷积，因为此时感受野已经太小。一般来说，我们发现5x5卷积可以在整个过程中被移除，只有很小的精度下降。图4比较了我们所有的模型。

4. 数据集和评估

We evaluate our method on four datasets and with the exception of Labelled Faces in the Wild and YouTube Faces we evaluate our method on the face verification task. I.e. given a pair of two face images a squared L2 distance threshold D(xi; xj) is used to determine the classification of same and different. All faces pairs (i; j) of the same identity are denoted with Psame, whereas all pairs of different identities are denoted with Pdiff.我们在四个数据集上评估我们的方法，除了Wild和YouTube Faces中的Labeled Faces，我们在面部验证任务上评估我们的方法。即给定一对两个面部图像，使用平方L2距离阈值D（xi; xj）来确定相同和不同的分类。具有相同身份的所有面部对（i; j）用Psame表示，而所有不同身份对用Pdiff表示。我们定义TA和FA为：

The validation rate VAL(d) and the false accept rate FAR(d) for a given face distance d are then defined as然后将在给定面部距离d下的验证率VAL（d）和错误接受率FAR（d）定义为

4.1. 保持测试集

We keep a hold out set of around one million images, that has the same distribution as our training set, but disjoint identities. For evaluation we split it into five disjoint sets of 200k images each. The FAR and VAL rate are then computed on 100k*100k image pairs. Standard error is reported across the five splits.我们保留了大约一百万张图像，与我们的训练集具有相同的分布，但不相同的身份。为了评估，我们将它分成五个不相交的数据集，每组200k图像。然后在100k*100k的图像对上计算FAR和VAL率。在五个数据集的结果中报告标准错误。

4.2. 个人照片

This is a test set with similar distribution to our training set, but has been manually verified to have very clean labels. It consists of three personal photo collections with a total of around 12k images. We compute the FAR and VAL rate across all 12k squared pairs of images.这是一个与我们的训练集具有类似分布的测试集，但已经过手动验证，具有非常干净的标签。它由三个个人照片集合组成，总共大约12k图像。我们计算所有12k平方对图像的FAR和VAL率。

4.3. 学术数据集

Labeled Faces in the Wild (LFW) is the de-facto academic test set for face verification [7]. We follow the standard protocol for unrestricted, labeled outside data and report the mean classification accuracy as well as the standard error of the mean.野外标记面（LFW）是面部验证的事实上的学术测试集[7]。我们遵循标准协议，对无限制，标记的外部数据进行报告，并报告平均分类准确度以及均值的标准误差。

Youtube Faces DB [21] is a new dataset that has gained popularity in the face recognition community [17, 15]. The setup is similar to LFW, but instead of verifying pairs of images, pairs of videos are used.Youtube Faces DB [21]是一个新的数据集，在人脸识别领域得到了普及[17,15]。该设置类似于LFW，但不使用验证图像对，而是使用视频对。

5. 实验

If not mentioned otherwise we use between 100M-200M training face thumbnails consisting of about 8M different identities. A face detector is run on each image and a tight bounding box around each face is generated. These face thumbnails are resized to the input size of the respective network. Input sizes range from 96x96 pixels to 224x224 pixels in our experiments.如果没有提到，否则我们使用100M-200M训练面部缩略图，其中包含大约8M个不同的身份。在每个图像上运行面部检测器，并且生成围绕每个面的紧密边界框。这些面部缩略图的大小调整为相应网络的输入大小。在我们的实验中，输入尺寸范围从96x96像素到224x224像素。

5.1. 计算准确性权衡

Before diving into the details of more specific experiments we will discuss the trade-off of accuracy versus number of FLOPS that a particular model requires. Figure 4 shows the FLOPS on the x-axis and the accuracy at 0.001 false accept rate (FAR) on our user labelled test-data set from section 4.2. It is interesting to see the strong correlation between the computation a model requires and the accuracy it achieves. The figure highlights the five models (NN1, NN2, NN3, NNS1, NNS2) that we discuss in more detail in our experiments.在深入了解更具体的实验细节之前，我们将讨论精确度与特定模型所需的FLOPS数量之间的权衡。图4显示了x轴上的FLOPS以及4.2节中用户标记的测试数据集的0.001错误接受率（FAR）的准确度。有趣的是，模型所需的计算与其实现的精度之间存在很强的相关性。该图突出显示了我们在实验中更详细讨论的五个模型（NN1，NN2，NN3，NNS1，NNS2）。

We also looked into the accuracy trade-off with regards to the number of model parameters. However, the picture is not as clear in that case. For example, the Inception based model NN2 achieves a comparable performance to NN1, but only has a 20th of the parameters. The number of FLOPS is comparable, though. Obviously at some point the performance is expected to decrease, if the number of parameters is reduced further. Other model architectures may allow further reductions without loss of accuracy, just like Inception [16] did in this case.我们还研究了关于模型参数数量的准确性权衡。但是，在这种情况下，情况并不那么清楚。例如，基于Inception的模型NN2实现了与NN1相当的性能，但只有20个参数。但是，FLOPS的数量是可比的。显然，如果参数的数量进一步减少，预计性能会下降。其他模型架构可以允许进一步减少而不会损失准确性，就像Inception [16]在这种情况下所做的那样。

5.2. CNN模型的效果

We now discuss the performance of our four selected models in more detail. On the one hand we have our traditional Zeiler&Fergus based architecture with 11 convolutions [22, 9] (see Table 1). On the other hand we have Inception [16] based models that dramatically reduce the model size. Overall, in the final performance the top models of both architectures perform comparably. However, some of our Inception based models, such as NN3, still achieve good performance while significantly reducing both the FLOPS and the model size.我们现在更详细地讨论我们四个选定模型的性能。一方面，我们采用传统的基于Zeiler和Fergus的架构，具有1*1卷积[22,9]（见表1）。另一方面，我们有基于Inception [16]的模型，可以大大减少模型的大小。总的来说，在最终的表现中，两种架构的顶级模型表现相当。但是，我们的一些基于Inception的模型（如NN3）仍然可以实现良好的性能，同时显着降低FLOPS和模型大小。

Table 3. Network Architectures. This table compares the performance of our model architectures on the hold out test set (see section 4.1). Reported is the mean validation rate VAL at 10E-3 false accept rate. Also shown is the standard error of the mean across the five test splits.表3.网络体系结构。该表比较了我们的模型架构在保持测试集上的性能（参见第4.1节）。报告的是10E-3错误接受率下的平均验证率VAL。还显示了五个测试分裂中的平均值的标准误差。

Figure 5. Network Architectures. This plot shows the complete ROC for the four different models on our personal photos test set from section 4.2. The sharp drop at 10E-4 FAR can be explained by noise in the groundtruth labels. The models in order of performance are: NN2: 224224 input Inception based model; NN1: Zeiler&Fergus based network with 11 convolutions; NNS1: small Inception style model with only 220M FLOPS; NNS2: tiny Inception model with only 20M FLOPS.图5.网络架构该图显示了4.2节中我们个人照片测试集中四种不同模型的完整ROC。 10E-4 FAR的急剧下降可以通过groundtruth标签中的噪声来解释。性能顺序为：NN2：224*224输入基于初始的模型; NN1：基于Zeiler＆Fergus的网络，1*1卷积; NNS1：小型Inception风格型号，仅有220M FLOPS; NNS2：微小的Inception模型，只有20M FLOPS。

The detailed evaluation on our personal photos test set is shown in Figure 5. While the largest model achieves a dramatic improvement in accuracy compared to the tiny NNS2, the latter can be run 30ms / image on a mobile phone and is still accurate enough to be used in face clustering. The sharp drop in the ROC for FAR < 10e-4 indicates noisy labels in the test data groundtruth. At extremely low false accept rates a single mislabeled image can have a significant impact on the curve.我们的个人照片测试装置的详细评估如图5所示。虽然最小的型号与小型NNS2相比可以显着提高精度，后者可以在手机上运行30ms /图像，并且仍然足够准确用于面部聚类。 FAR <10e-4的ROC急剧下降表明测试数据中的标签噪声很大。在极低的错误接受率下，单个错误标记的图像会对曲线产生显着影响。

5.3. 对图片质量的敏感度

Table 4. Image Quality. The table on the left shows the effect on the validation rate at 10E-3 precision with varying JPEG quality. The one on the right shows how the image size in pixels effects the validation rate at 10E-3 precision. This experiment was done with NN1 on the first split of our test hold-out dataset.表4.图像质量。左侧的表格显示了不同JPEG质量对10E-3精度的验证率的影响。右边的那个显示图像大小（以像素为单位）如何影响10E-3精度的验证率。在我们的测试保持数据集的第一次拆分中，使用NN1完成了该实验。

Table 4 shows the robustness of our model across a wide range of image sizes. The network is surprisingly robust with respect to JPEG compression and performs very well down to a JPEG quality of 20. The performance drop is very small for face thumbnails down to a size of 120x120 pixels and even at 80x80 pixels it shows acceptable performance. This is notable, because the network was trained on 220x220 input images. Training with lower resolution faces could improve this range further.表4显示了我们的模型在各种图像尺寸上的稳健性。该网络在JPEG压缩方面非常强大，并且在JPEG质量为20时表现非常好。对于尺寸为120x120像素的面部缩略图，性能下降非常小，即使在80x80像素下也表现出可接受的性能。这是值得注意的，因为网络是在220x220输入图像上训练的。低分辨率面部训练可以进一步改善这个范围。

5.4. 嵌入维度

Table 5. Embedding Dimensionality. This Table compares the effect of the embedding dimensionality of our model NN1 on our hold-out set from section 4.1. In addition to the VAL at 10E-3 we also show the standard error of the mean computed across five splits.表5.嵌入维度。该表比较了我们的模型NN1的嵌入维数对4.1节中的保持集的影响。除了在10E-3处的VAL之外，我们还显示了在五个子数据集中计算的平均值的标准误差。

We explored various embedding dimensionalities and selected 128 for all experiments other than the comparison reported in Table 5. One would expect the larger embeddings to perform at least as good as the smaller ones, however, it is possible that they require more training to achieve the same accuracy. That said, the differences in the performance reported in Table 5 are statistically insignificant.我们探索了各种嵌入维度，并且除了表5中报告的比较之外，所有实验都选择了128个。可以预期较大的嵌入至少与较小的嵌入一样好，但是，它们可能需要更多的训练来实现同样准确。也就是说，表5中报告的性能差异在统计上是不显着的。

It should be noted, that during training a 128 dimensional float vector is used, but it can be quantized to 128-bytes without loss of accuracy. Thus each face is compactly represented by a 128 dimensional byte vector, which is ideal for large scale clustering and recognition. Smaller embeddings are possible at a minor loss of accuracy and could be employed on mobile devices.应该注意，在训练期间使用128维浮点矢量，但是它可以被量化为128字节而不会损失精度。因此，每个面由128维字节矢量紧凑地表示，这对于大规模聚类和识别是理想的。较小的嵌入可能会在很小的精度下丢失，并且可以在移动设备上使用。

5.5. 训练数据的大小

Table 6. Training Data Size. This table compares the performance after 700h of training for a smaller model with 96x96 pixel inputs. The model architecture is similar to NN2, but without the 5x5 convolutions in the Inception modules.表6.培训数据大小。该表比较了较小型号与96x96像素输入的700h训练后的性能。模型体系结构类似于NN2，但在Inception模块中没有5x5卷积。

Table 6 shows the impact of large amounts of training data. Due to time constraints this evaluation was run on a smaller model; the effect may be even larger on larger models. It is clear that using tens of millions of exemplars results in a clear boost of accuracy on our personal photo test set from section 4.2. Compared to only millions of images the relative reduction in error is 60%. Using another order of magnitude more images (hundreds of millions) still gives a small boost, but the improvement tapers off.表6显示了大量训练数据的影响。由于时间限制，此评估是在较小的模型上运行的; 在较大型号上效果可能更大。很明显，使用数千万个样本可以明显提高4.2节中我们个人照片测试集的准确性。与仅数百万的图像相比，误差的相对减少为60％。使用另一个数量级的更多图像（数亿）仍然可以提供小幅提升，但改进逐渐减少。

5.6. 在LFW数据集上的表现

We evaluate our model on LFW using the standard protocol for unrestricted, labeled outside data. Nine training splits are used to select the L2-distance threshold. Classification (same or different) is then performed on the tenth test split. The selected optimal threshold is 1.242 for all test splits except split eighth (1.256).我们使用标准协议评估LFW上的模型，用于不受限制的标记外部数据。九个训练分裂用于选择L2距离阈值。然后在第十个测试分割上执行分类（相同或不同）。对于除分裂八分之一（1.256）之外的所有测试分裂，所选择的最佳阈值是1.242。

Our model is evaluated in two modes:我们的模型有两种评估模式：
- Fixed center crop of the LFW provided thumbnail.固定的LFW中心裁剪提供了缩略图。
- A proprietary face detector (similar to Picasa [3]) is run on the provided LFW thumbnails. If it fails to align the face (this happens for two images), the LFW alignment is used.专有的人脸检测器（类似于Picasa [3]）在提供的LFW缩略图上运行。如果它无法对齐面部（两个图像会发生这种情况），则使用LFW对齐。
Figure 6. LFW errors. This shows all pairs of images that were incorrectly classified on LFW. Only eight of the 13 false rejects shown here are actual errors the other five are mislabeled in LFW.图6. LFW错误。这显示了在LFW上错误分类的所有图像对。这里显示的13个拒绝错误中只有8个是实际错误，其他5个是LFW中的错误标记。

Figure 6 gives an overview of all failure cases. It shows false accepts on the top as well as false rejects at the bottom. We achieve a classification accuracy of 98.87%0:15 when using the fixed center crop described in (1) and the record breaking 99.63%0.09 standard error of the mean when using the extra face alignment (2). This reduces the error reported for DeepFace in [17] by more than a factor of 7 and the previous state-of-the-art reported for DeepId2+ in [15] by 30%. This is the performance of model NN1, but even the much smaller NN3 achieves performance that is not statistically significantly different.图6概述了所有故障情况。它显示顶部的错误接受以及底部的错误拒绝。当使用（1）中描述的固定中心作物时，我们实现了98.87％≤0.15的分类精度，并且当使用额外面部对齐时（2），平均值达到99.63％？0.09标准误差。这将[17]中针对DeepFace报告的错误减少了7倍以上，并且[15]中针对DeepId2 +报告的先前最新技术水平降低了30％。这是NN1型号的性能，但即使是更小的NN3也能实现统计上没有显着差异的性能。

5.7. 在YouTube Faces DB数据集上的表现

We use the average similarity of all pairs of the first one hundred frames that our face detector detects in each video. This gives us a classification accuracy of 95.12%0:39. Using the first one thousand frames results in 95.18%. Compared to [17] 91.4% who also evaluate one hundred frames per video we reduce the error rate by almost half. DeepId2+ [15] achieved 93.2% and our method reduces this error by 30%, comparable to our improvement on LFW.我们使用人脸检测器在每个视频中检测到的前100帧的所有对的平均相似度。这使我们的分类准确度为95.12％？0.39。使用前一千帧导致95.18％。与[17] 91.4％同时评估每个视频100帧的人相比，我们将错误率降低了近一半。 DeepId2 + [15]达到93.2％，我们的方法将此误差减少了30％，与我们对LFW的改进相当。

5.8. 人脸聚类

Figure 7. Face Clustering. Shown is an exemplar cluster for one user. All these images in the users personal photo collection were clustered together.图7.面部聚类。显示的是一个用户的示例群集。用户个人照片集中的所有这些图像都聚集在一起。

Our compact embedding lends itself to be used in order to cluster a users personal photos into groups of people with the same identity. The constraints in assignment imposed by clustering faces, compared to the pure verification task,lead to truly amazing results. Figure 7 shows one cluster in a users personal photo collection, generated using agglomerative clustering. It is a clear showcase of the incredible invariance to occlusion, lighting, pose and even age.我们的紧凑嵌入适合用于将用户个人照片聚集到具有相同身份的人群中。与纯验证任务相比，聚类面临的约束导致真正惊人的结果。图7显示了使用凝聚聚类生成的用户个人照片集中的一个聚类。它清晰地展示了遮挡，光线，姿势甚至年龄的惊人不变性。

6. 总结

We provide a method to directly learn an embedding into an Euclidean space for face verification. This sets it apart from other methods [15, 17] who use the CNN bottleneck layer, or require additional post-processing such as concatenation of multiple models and PCA, as well as SVM classification. Our end-to-end training both simplifies the setup and shows that directly optimizing a loss relevant to the task at hand improves performance.我们提供了一种直接学习嵌入欧几里德空间进行面部验证的方法。这使得它与使用CNN瓶颈层的其他方法[15,17]不同，或者需要额外的后处理，例如多个模型和PCA的串联，以及SVM分类。我们的端到端培训既简化了设置，又表明直接优化与手头任务相关的损失可以提高性能。

Another strength of our model is that it only requires minimal alignment (tight crop around the face area). [17], for example, performs a complex 3D alignment. We also experimented with a similarity transform alignment and notice that this can actually improve performance slightly. It is not clear if it is worth the extra complexity.我们模型的另一个优势是它只需要最小的对齐（面部周围紧密的裁剪）。 [17]，例如，执行复杂的3D对齐。我们还尝试了相似性变换对齐，并注意到这实际上可以略微提高性能。目前尚不清楚是否值得额外的复杂性。

Future work will focus on better understanding of the error cases, further improving the model, and also reducing model size and reducing CPU requirements. We will also look into ways of improving the currently extremely long training times, e.g. variations of our curriculum learning with smaller batch sizes and offline as well as online positive and negative mining.未来的工作将侧重于更好地理解错误情况，进一步改进模型，并减少模型大小和降低CPU要求。我们还将探讨如何改善目前极长的训练时间，例如：我们的课程学习的变化与较小的批量和离线以及在线积极和消极的挖掘。

7. 附录：谐波嵌入

In this section we introduce the concept of harmonic embeddings. By this we denote a set of embeddings that are generated by different models v1 and v2 but are compatible in the sense that they can be compared to each other.在本节中，我们将介绍谐波嵌入的概念。通过这个，我们表示由不同模型v1和v2生成的一组嵌入，但是在它们可以彼此比较的意义上是兼容的。

Figure 8. Harmonic Embedding Compatibility. These ROCs show the compatibility of the harmonic embeddings of NN2 to the embeddings of NN1. NN2 is an improved model that performs much better than NN1. When comparing embeddings generated by NN1 to the harmonic ones generated by NN2 we can see the compatibility between the two. In fact, the mixed mode performance is still better than NN1 by itself.图8.谐波嵌入兼容性。这些ROC显示了NN2的谐波嵌入与NN1嵌入的兼容性。 NN2是一种改进的模型，其性能远优于NN1。当比较NN1生成的嵌入与NN2生成的谐波嵌入时，我们可以看到两者之间的兼容性。实际上，混合模式性能本身仍然优于NN1。

This compatibility greatly simplifies upgrade paths. E.g. in an scenario where embedding v1 was computed across a large set of images and a new embedding model v2 is being rolled out, this compatibility ensures a smooth transition without the need to worry about version incompatibilities. Figure 8 shows results on our 3G dataset. It can be seen that the improved model NN2 significantly outperforms NN1, while the comparison of NN2 embeddings to NN1 embeddings performs at an intermediate level.这种兼容性极大地简化了升级路径。例如。在一个大型图像集中计算嵌入v1并且正在推出新的嵌入模型v2的情况下，这种兼容性可确保平滑过渡，而无需担心版本不兼容。图8显示了我们的3G数据集的结果。可以看出，改进的模型NN2明显优于NN1，而NN2嵌入与NN1嵌入的比较在中间水平上执行。

7.1. 谐波三元组损失

Figure 9. Learning the Harmonic Embedding. In order to learn a harmonic embedding, we generate triplets that mix the v1 embeddings with the v2 embeddings that are being trained. The semihard negatives are selected from the whole set of both v1 and v2 embeddings.图9.学习谐波嵌入。为了学习谐波嵌入，我们生成三元组，将v1嵌入与正在训练的v2嵌入混合。从整个v1和v2嵌入集合中选择半硬性负样本。

In order to learn the harmonic embedding we mix embeddings of v1 together with the embeddings v2, that are being learned. This is done inside the triplet loss and results in additionally generated triplets that encourage the compatibility between the different embedding versions. Figure 9 visualizes the different combinations of triplets that contribute to the triplet loss.为了学习谐波嵌入，我们将v1的嵌入与v2的嵌入混合，这是正在学习的。这是在元组损失内部完成的，并导致额外生成的三元组，这些三元组促进了不同嵌入版本之间的兼容性。图9显示了导致三重态损失的三重态的不同组合。

We initialized the v2 embedding from an independently trained NN2 and retrained the last layer (embedding layer) from random initialization with the compatibility encouraging triplet loss. First only the last layer is retrained, then we continue training the whole v2 network with the harmonic loss.我们从独立训练的NN2初始化v2嵌入，并从随机初始化中重新训练最后一层（嵌入层），兼容性鼓励三重态丢失。首先只重新训练最后一层，然后我们继续训练整个v2网络的谐波损失。

Figure 10. Harmonic Embedding Space. This visualisation sketches a possible interpretation of how harmonic embeddings are able to improve verification accuracy while maintaining compatibility to less accurate embeddings. In this scenario there is one misclassified face, whose embedding is perturbed to the “correct” location in v2.图10.谐波嵌入空间。该可视化概述了谐波嵌入如何在提高验证准确性的同时保持与不太精确的嵌入的兼容性的可能解释。在这种情况下，有一个错误分类的面，其嵌入被扰乱到v2中的“正确”位置。

Figure 10 shows a possible interpretation of how this compatibility may work in practice. The vast majority of v2 embeddings may be embedded near the corresponding v1 embedding, however, incorrectly placed v1 embeddings can be perturbed slightly such that their new location in embedding space improves verification accuracy.图10显示了这种兼容性在实践中如何起作用的可能解释。绝大多数v2嵌入可以嵌入在相应的v1嵌入附近，然而，错误放置的v1嵌入可以稍微扰动，使得它们在嵌入空间中的新位置提高了验证准确性。

7.2. 总结

These are very interesting findings and it is somewhat surprising that it works so well. Future work can explore how far this idea can be extended. Presumably there is a limit as to how much the v2 embedding can improve over v1, while still being compatible. Additionally it would be interesting to train small networks that can run on a mobile phone and are compatible to a larger server side model.这些都是非常有趣的发现，它有点令人惊讶，它运作良好。未来的工作可以探索这个想法可以扩展到多远。据推测，v2嵌入可以比v1提高多少，但仍然兼容。另外，培训可以在移动电话上运行并且与更大的服务器端模型兼容的小型网络将是有趣的。

致谢

We would like to thank Johannes Steffens for his discussions and great insights on face recognition and Christian Szegedy for providing new network architectures like [16] and discussing network design choices. Also we are indebted to the DistBelief [4] team for their support especially to Rajat Monga for help in setting up efficient training schemes.我们要感谢Johannes Steffens关于人脸识别的讨论和深刻见解，以及Christian Szegedy提供的新网络架构，如[16]和讨论网络设计选择。此外，我们感谢DistBelief [4]团队的支持，尤其是对Rajat Monga的帮助建立有效的培训计划。

Also our work would not have been possible without the support of Chuck Rosenberg, Hartwig Adam, and Simon Han.如果没有Chuck Rosenberg，Hartwig Adam和Simon Han的支持，我们的工作也是不可能的。

参考文献
- [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In Proc. of ICML, New York, NY, USA, 2009. 2
- [2] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In Proc. ECCV, 2012. 2
- [3] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In Proc. ECCV, 2014. 7
- [4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1232–1240. 2012. 10
- [5] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. 4
- [6] I. J. Goodfellow, D. Warde-farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In In ICML, 2013. 4
- [7] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. 5
- [8] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Dec. 1989. 2, 4
- [9] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2, 4, 6
- [10] C. Lu and X. Tang. Surpassing human-level face verification performance on LFW with gaussianface. CoRR, abs/1404.3840, 2014. 1
- [11] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 1986. 2, 4
- [12] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf, editors, NIPS, pages 41–48. MIT Press, 2004. 2
- [13] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and expression (PIE) database. In In Proc. FG, 2002. 2
- [14] Y. Sun, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. CoRR, abs/1406.4773, 2014. 1, 2, 3
- [15] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. CoRR, abs/1412.1265, 2014. 1, 2, 5, 8
- [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. 2, 3, 4, 5, 6, 10
- [17] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance in face verification. In IEEE Conf. on CVPR, 2014. 1, 2, 5, 7, 8, 9
- [18] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu. Learning fine-grained image similarity with deep ranking. CoRR, abs/1404.4661, 2014. 2
- [19] K. Q.Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS. MIT Press, 2006. 2, 3
- [20] D. R. Wilson and T. R. Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10):1429–1451, 2003. 4
- [21] L. Wolf, T. Hassner, and I. Maoz. Face recognition in unconstrained videos with matched background similarity. In IEEE Conf. on CVPR, 2011. 5
- [22] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. 2, 3, 4, 6
- [23] Z. Zhu, P. Luo, X. Wang, and X. Tang. Recover canonicalview faces in the wild with deep neural networks. CoRR, abs/1404.3543, 2014. 2
Read All

python调用第三方动态库

2018-08-10

python

python 混合编程动态库 C

摘要
引言
主题
实例
附录
参考

摘要

本文讲述python混合编程之调用动态库

引言

python因为良好的编码性和扩展库正被大规模的使用，但他有两个缺点：1、代码可见；2、执行效率低，于是在实际应用中经常会把高效和核心代码用C/C++实现，业务部分用python实现。这就需要进行混合编程，本文对python调用动态库的方法及注意事项进行记录

主题

python标准库函数中提供了调用动态库的包————ctypes

加载动态库

查找动态库ctypes.util.find_library

根据动态库调用方式的不同，可以分为cdecl和stdcall两种，这两种方式的主要区别见下表。后面的例子以cdecl调用方式为例，stdcall类同。

调用标准	内存栈维护者	函数名
cdecl	调用者	前面加下划线，后面加“@”符号和参数的字节数
stdcall	被调用者	在输出函数名前面加下划线

ctypes加载动态库有两种方式。构造类对象libc = CDLL("libtestlib.dll")和实例化instancelibc = cdll.LoadLibrary("libtestlib.dll")。这两种方式都会返回一个动态库操作的句柄，以供后面使用

类型

常规参数类型

C常规参数类型与ctypes、python类型对应关系表如下：

C type	ctypes type	Python type
_Bool	c_bool	bool (1)
char	c_char	1-character bytes object
wchar_t	c_wchar	1-character string
char	c_byte	int
unsigned char	c_ubyte	int
short	c_short	int
unsigned short	c_ushort	int
int	c_int	int
unsigned int	c_uint	int
long	c_long	int
unsigned long	c_ulong	int
__int64 or long long	c_longlong	int
unsigned __int64 or unsigned long long	c_ulonglong	int
size_t	c_size_t	int
ssize_t or Py_ssize_t	c_ssize_t	int
float	c_float	float
double	c_double	float
long double	c_longdouble	float
char * (NUL terminated)	c_char_p	bytes object or None
wchar_t * (NUL terminated)	c_wchar_p	string or None
void *	c_void_p	int or None

指针
- 弱引用指针byref()，这种方式速度更快
- 强指针pointer()
- c_types定义的指针c_char_p、c_wchar_p、c_void_p不可修改，如果需要在C函数中被修改，需要使用函数create_string_buffer()创建可变内存
- 用指针的value属性获取指针指向的内容
结构体/联合体
- 通过构建类继承Structure/Union来实现结构体的定义，把结构体属性组合成元组数组放在类中的_fields属性中。关于结构体的其它特性（对齐、指针嵌套、位域等）请参照官网
```
class POINT(Structure):
   _fields_ = [("x", c_int),
               ("y", c_int)]
class RECT(Structure):
   _fields_ = [("upperleft", POINT),
               ("lowerright", POINT)]
```
数组直接使用类型 * 元素个数的方式来定义，如array = c_int * 10

函数指针（回调函数）CFUNCTYPE

使用注释的方式一次定义

@CFUNCTYPE(c_int, c_int)
def py_cb_func(a):
    print("py_cb_func", str(a))
    return a + 1

标准方式

# 定义c_types类型
PY_CB_FUNC = CFUNCTYPE(c_int, c_int)
def py_cb_func(a):
    print("py_cb_func", str(a))
    return a + 1
cb_func = PY_CB_FUNC(py_cmp_func)

函数

函数入参声明libtest.parm_int.argtypes = [c_int]，做了函数声明后，会在python调用时对参数格式进行检查
函数返回值类型声明libtest.parm_int.restype = c_int，做了函数声明后，会在python调用时对参数格式进行检查
函数调用ret = libtest.parm_int(c_int(1))

实例

我专门针对c_types调用动态库写了一个实例。实例包括两个部分

用C实现的动态库代码。主要从入参类型、返回值类型、回调函数调用三个方面实现及提供接口
用python实现的对动态库调用代码。用于调用动态库的所有接口

附录

参考

Read All

logging模块多进程解决方案

2018-06-27

python

python logging windows 多进程 flask
- 摘要
- 引言
- 主题
- 总结
- 附录
- 参考
摘要

本文讲述如何在多进程中使用logging模块记录到同一文件

引言

从Python2.3起,Python的标准库加入了logging模块。 logging模块是Python内置的标准模块，主要用于输出运行日志，可以设置输出日志的等级、日志保存路径、日志文件回滚等。但在实际使用flask时，出现多进程写入同一日志文件冲突问题。本文用以记录此问题的解决方案

主题

logging模块

从Python2.3起,Python的标准库加入了logging模块。 logging模块是Python内置的标准模块，主要用于输出运行日志，可以设置输出日志的等级、日志保存路径、日志文件回滚等。相关使用方法可以参考博客

flask

Flask是一个使用 Python 编写的轻量级 Web 应用框架。让我们可以使用Python语言快速实现一个网站或Web服务。相关使用方法可以参考中文文档

Flask在0.3版本后就有了日志工具logger，此工具是在Python的标准库logging模块进行二次封装，所以我们可以直接进行日志记录

问题场景

我们做的日志需要记录服务器端收到的每一次网络请求。包括请求ip、请求内容等。日志文件按照50M进行自动分割。按照网上搜索的教程配置完毕后报错
```
--- Logging error ---
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\logging\handlers.py", line 72, in emit
    self.doRollover()
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\logging\handlers.py", line 173, in doRollover
    self.rotate(self.baseFilename, dfn)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\logging\handlers.py", line 113, in rotate
    os.rename(source, dest)
PermissionError: [WinError 32] 另一个程序正在使用此文件，进程无法访问。: 'E:\\ml_service_backup\\ml_db\\python\\server\\my.log' -> 'E:\\ml_service_backup\\ml_db\\python\\server\\my.log.1'
```
问题原因

flask 在执行网络请求时，会为请求端分配单独进程进行处理。此时就存在有多个进程同时进行日志处理的情况。文件分割需要对文件进行复制、删除等操作。而logging模块本身对于多个进程往同一个文件写日志不是安全的。

Because there is no standard way to serialize access to a single file across multiple processes in Python. If you need to log to a single file from multiple processes, one way of doing this is to have all the processes log to a SocketHandler, and have a separate process which implements a socket server which reads from the socket and logs to file. (If you prefer, you can dedicate one thread in one of the existing processes to perform this function.)

于是就出现了上述错误

解决方案

需要解决logging模块多个进程往同一个文件写日志的问题。 concurrent-log-handler是专门针对logging模块进程不安全问题做的一个封装，解决步骤如下：
- 安装模块pip install concurrent-log-handler
- 添加日志配置文件logging.ini ``` ############################################## [loggers] keys=root
[logger_root] level=DEBUG handlers=handler01 ############################################## [handlers] keys=handler01

[handler_handler01] ;class=handlers.ConcurrentRotatingFileHandler class=handlers.RotatingFileHandler level=DEBUG formatter=form01 args=(“my.log”, “a”, 512, 5) ############################################## [formatters] keys=form01

[formatter_form01] format=%(asctime)s - %(levelname)s - %(message)s ;datefmt=[%Y-%m-%d %H:%M:%S] ############################################## ```
- 在自己的代码中添加引用from concurrent_log_handler import ConcurrentRotatingFileHandler
- 在自己的代码中加载配置logging.config.fileConfig('logging.ini')
- 在自己的代码中加载配置logging.config.fileConfig('logging.ini')
- 加入日志记录代码app.logger.debug('%s:%s@%s'% (sys._getframe().f_code.co_name, request.remote_addr, json.dumps(request.form)))
总结
- 在pypi官网上找相关模块信息
  
  最开始在网上搜到的方案是ConcurrentLogHandler，但在13年就停止维护了，合入代码也无法运行。于是又在网上找其他方案（这里浪费了不少时间），其实 ConcurrentLogHandler的homepage页已经说明了替代的库
附录

参考
Read All

3/4

Welcome To gdyshi's Blog

摘要

引言

主题

meta方式

查找最终输出的op

查找输入tensor

验证能否强行使用feed_dict改变变量的值

batchnorm问题

附录

参考

摘要

1. 介绍

2. 相关工作

3. 对抗网络

4. 理论结果

4.1. pg = pdata的全局最优性

4.2. 算法1的收敛性

5. 试验

6. 优势和劣势

7. 结论和未来的工作

致谢

参考文献

摘要

引言

主题

tensorflow高级api

移植

问题现象

问题分析

问题原因

解决方案

总结

附录

参考

摘要

1. 介绍

2. 相关工作

3. 方法

3.1. 三元组损失

3.2. 三元组的选择

3.3. 深度卷积网络

4. 数据集和评估

4.1. 保持测试集

4.2. 个人照片

4.3. 学术数据集

5. 实验

5.1. 计算准确性权衡

5.2. CNN模型的效果

5.3. 对图片质量的敏感度

5.4. 嵌入维度

5.5. 训练数据的大小

5.6. 在LFW数据集上的表现

5.7. 在YouTube Faces DB数据集上的表现

5.8. 人脸聚类

6. 总结

7. 附录：谐波嵌入

7.1. 谐波三元组损失

7.2. 总结

致谢

参考文献

摘要

引言

主题

加载动态库

类型

函数

实例

附录

参考

摘要

引言

主题

问题场景

问题原因

解决方案

总结

附录

参考