After learning theory courses, many people will be interested in trying to build their own project. This article will start from the first step and tell you how to solve various problems encountered in project development.
This article consists of six major parts, covering the entire process of deep learning (DL) projects. We will use an auto-comic coloring project to illustrate the design, program debugging, and parameter adjustment processes of deep learning.
The topic of this article is "How to start a deep learning project?", which is divided into the following six parts:
Part 1: Start a deep learning project
Part 2: Create a deep learning dataset
Part 3: Design a deep learning model
Part 4: html l7 Visualization of Deep network model and metrics
Part 5: Debugging in deep learning network
Part 6: Improving the performance of deep learning model and network parameter adjustment
Part 1: Start a deep learning project
What kind of project should be selected?
Many artificial intelligence projects are not that serious and are still very interesting to do. In early 2017, I set out to start a project to color Japanese comics and as part of my research on Generative Adversarial Networks (GAN). This problem is difficult to solve, but it is very attractive, especially for someone like me who can’t draw! When looking for projects, don’t limit yourself to incremental improvements, make a marketable product, or create a new model with faster learning speed and higher quality.
Debugging deep network (DN) is very tricky
Training deep learning models requires millions of iterations, so the process of finding bugs is very difficult and easy to collapse. Therefore, we need to start from simple things and step by step. For example, the optimization of the model (such as regularization) can always be performed after the code debugging is completed. In addition, we need to visualize the prediction results and model metrics frequently, and we need to make the model run first, so that there is a baseline that can be backed off. We'd better not get stuck in a big model and try to get all the modules done.
Measurements and learning
Magnificent project plans may have tragic failures. The first version of most individual projects lasts two to four months, which is very short because research, debugging and experimentation take a lot of time. Generally we arrange these complex experiments to run all night, and by the early morning of the next day, we want to get enough information to take the next action. In the early stages, these experiments should not exceed 12 hours, which is a good rule of thumb. To do this, we narrowed the comic coloring project to the coloring of a single animated character. Furthermore, we need to design a lot of tests, so we use them to analyze the shortcomings of the model in the experiment. Generally, these tests should not be planned too far. We need to measure, learn quickly, and provide enough feedback for the next step of design.
Research and Products
When we started discussing the comic coloring project in spring 2017, Kevin Frans had a Deepcolor project that used GAN to add color tips to comics.
When determining the target, you will spend a lot of effort to make sure that the project is still meaningful after completion. The GAN model is quite complex and has not reached the quality level required for embedded products in early 2017. However, if you narrow down the application scope to the point where the product can be handled ingeniously, you can improve the quality to commercial standards. To do this, no matter what kind of DL project you start, you must grasp the balance between model generalization, capacity and accuracy.
Cost
must use GPU to train the actual model. It is 20 to 100 times faster than the CPU. The lowest-priced Amazon GPU p2.xlarge site instances are asking for $7.5 per day, while the 8-core GPUs are priced at $75 per day. In our comic coloring project, some experiments take more than two days, so the average cost is at least $150 per week.As for faster AWS instances, it can cost up to $1500 per week. We can buy standalone computers instead of using cloud computing. In February 2018, the desktop with the Nvidia GeForce GTX 1080 Ti was about $2,200. When training a refined VGG model, it is about 5 times faster than the P2 instance.
Timeline
We divide the development into four stages, and the last three stages are carried out in multiple iterations.
project research
model design
implementation and debugging
experiments and parameter adjustment
project research
We will first study the existing products to explore their weaknesses. Many GAN-type solutions use spatial color hints, the pattern is a bit unclear and sometimes there are mixed colors. We set a two-month timeframe for our project, with two priorities: generating colors without prompts and improving color fidelity. Our goal is:
color the grayscale comics on a single animation character without using space color hints.
Standing on the shoulders of giants
Next, we need to understand related research and open source projects, and many people need to read at least dozens of papers and projects before starting practice. For example, when we delve into GAN, we find that there are more than a dozen new GAN models: DRAGAN, cGAN, LSGAN, etc., reading research papers can be painful, but very meaningful.
Deep Learning (DL) code is concise, but it is difficult to check for defects, and many research papers often miss implementation details. Many projects start with open source implementation and solve similar problems, so we can search more open source projects. So we looked at the code implementations of different GAN variants on GitHub and tested them several times.
Part 2: Create a deep learning dataset
The success of the deep learning project depends on the quality of the dataset. In Part 2 of this article, we will explore the core issues of creating quality training datasets.
Public and Academic Data Sets
For research projects, you can search for established public data sets. These datasets can provide neat sample and baseline model performance. If you have multiple public data sets available, select the most relevant and best quality sample for your question.
Custom dataset
For practical problems, we need samples from the problem area. First try to find the public dataset. Research on creating high-quality custom datasets is still lacking. If there is no available information, please search for where you can crawl the data. There are usually a lot of references in this location, but the data quality is usually low and requires a lot of effort to sort it out. Before grabbing samples, make time to evaluate all options and select the most relevant options.
high-quality data set should include the following characteristics:
Category balanced
sufficient data
data and tags have high-quality information
data and tag errors are very small
related to your problem
Don't crawl all data at once. We often use tags and categories to crawl website samples to obtain data related to our problems. The best way to crawl is to train, test small samples in your model, and improve the crawl based on the lessons learned.
is very important to clean up the data you crawl, otherwise even the best model design will not be comparable to that of humans. Danbooru and Safebooru are two very popular sources of anime characters, but some deep learning applications prefer Getchu for higher quality drawings. We can use a set of tags to download images from Safebooru and visually check samples and run tests to analyze errors (underperforming samples). Both
model training and visual evaluation provide further information to refine our tag selection. As the iteration continues, we will learn more and gradually accumulate samples.We also need to use a classifier to further filter samples that are not related to the problem, such as clearing all images of characters that are too small. Compared to academic datasets, small projects collect few samples and transfer learning can be applied where appropriate. The left picture below is provided by PaintsChainer, and the right picture is colored by the final model:
We decided to use some training samples to test the algorithm. The result was not surprising, the colors applied were fewer and the style was not correct.
Because the model has been trained for a period of time, we know what kind of drawing performance is poor. As expected, intricately structured drawings are harder to color.
This means that it is very important to select samples well. As a product, PaintsChainer is very wise to focus on the types of lines they are good at. This time I used clean line art I picked from the internet and it turned out to be a surprise again.
Here are some lessons: there is no difference between good and bad in data, but some data cannot meet your needs. Furthermore, as sample categories increase, training and maintaining output quality becomes more difficult, and deleting irrelevant data can result in a better model.
In the early stages of development, we realized that some drawings have too many intricate structures. Without significantly increasing the capacity of the model, these plots produce little value during training, so it is best not to use them, otherwise it will only affect the training efficiency.
Focus on review
Use public data sets as much as possible;
find the best website that can obtain high-quality and diverse samples;
analyzes errors and filters out samples that are not related to actual problems;
iteratively creates your samples;
balances the number of samples in each category;
organizes samples before training;
collects enough samples. If the samples are insufficient, transfer learning is applied.
Part 3: Deep Learning Design
Part 3 introduces some high-level deep learning strategies. Next we will introduce in detail the most common design choices, which may require some basic DL background.
Simple and flexible
Design should be simple and compact at the beginning. During the learning stage, people's minds will be filled with a lot of cool ideas. We tend to encode all the details in one go. But this is unrealistic, and the result of wanting to surpass the top in the beginning is not realistic. Start with less network layers and customization, and then do some necessary hyperparameter adjustment solutions later. All of these need to be verified that the loss function has been decreasing, so don't waste time on larger models from the beginning.
After a brief Debug, our model produces simple results after 5000 iterations. But at least the color on the model begins to be limited to a fixed area, and the skin tone is also somewhat revealed.
Whether the model starts to color, the above results give us valuable feedback. So don't start with the big model, or you'll spend a lot of time debugging and training the model.
priority and incremental design
First of all, in order to create a simple design, we need to choose priority. Decompose complex problems into small problems and solve them step by step. The right strategy for doing deep learning is to quickly execute what you learned. Before jumping to using no hints model, we first use a model with spatial color hints. Don't jump to the "no hint" model design in one step. For example, if we first remove the spatial information in the hint, the color quality will drop sharply, so we change our priority and refine our model before doing the next step. We will encounter many surprises in the process of designing the model. Compared to making a long-term plan that needs to be constantly changed, it is better to be a priority-driven plan. Use shorter, smaller design iterations to ensure project manageability.
Avoid random improvements
First analyze the weaknesses of your own model, rather than make random improvements, such as using bidirectional LSTM or PReLU.We need to determine the model problem based on visual model errors (scenarios with extremely poor performance) and performance parameters. Making improvements at will will backfire, which will increase training costs proportionally and will have extremely small returns.
Limit
We apply restrictions to network design to ensure more efficient training. Building deep learning is not simply stacking network layers together. Adding good constraints can make learning more efficient or smarter. For example, applying attention mechanisms can let the network know where to pay attention. In the variational autoencoder, we train hidden factors to obey normal distributions. In the design, we apply the denoising method to remove a large number of fractions implied by the spatial color by zeroing. What is ridiculous is that this allows the model to be better learned and generalized.
Design Details
Article In the next part, we will discuss some common design choices you will encounter in deep learning projects.
Deep Learning Software Framework
Since Google released TensorFlow in November 2015, it has become the most popular deep learning framework in just 6 months. Although it seems unlikely to have competitors in the short term, Facebook released PyTorch a year later, and it has attracted great attention from the research community. By 2018, there were a large number of deep learning platforms available, including TensorFlow, PyTorch, Caffe, Caffe2, MXNet, CNTK, etc.
There is a main reason why some researchers turn to PyTorch: PyTorch focuses on end-user in design, and the API is simple and intuitive. Error messages can be understood intuitively, and the API documentation is also very complete. Features in PyTorch, such as pre-training models, data pre-processing, and loading commonly used data sets are very popular.
TensorFlow is also great, but so far it still uses a bottom-up approach, making it extremely complicated. TensorFlow's API is very verbose, and Debug is also different. It has about a dozen API models for building deep networks.
As of February 2018, TensorFlow is still the leader. The developer community is still the largest. This is a very important factor. If you want to train models with multiple machines, or deploy inference engines to mobile phones, TensorFlow is the only option. However, if other platforms become more focused on end users, we can foresee more shifts from small projects to intermediate projects.
With the development of TensorFlow, there are many APIs to choose from to build deep networks. The highest level API is an evaluator that provides implicit integrations, while TensorBoard provides performance evaluation. The lowest-level API is very verbose and is available in many modules. It now combines the tf.layers, tf.metrics, and tf.losses modules with the encapsulator API, making it easier to establish the deep network layer.
For researchers who want a more intuitive API, there are also Keras, TFLearn, TF-Slim, etc., which can be used directly on TensorFlow. I suggest choosing a framework with the required pre-trained models and tools (to download the dataset). In addition, in academia, using the Keras API for prototyping is quite popular.
Transfer learning
Don't do duplicate work. Many deep learning software platforms have pre-trained models such as VGG19, ResNet, and Inception v3. Training from scratch is time consuming. As mentioned in the 2014 VGG paper, "The VGG model is trained with 4 Nvidia Titan Black GPUs, and it takes 2-3 weeks to train a single network based on the architecture."
Many pre-trained models can be used to solve deep learning problems. For example, we use a pre-trained VGG model to extract image features and feed these features back to the LSTM model to generate descriptions. Many pre-trained models are trained with ImageNet dataset. If your target data is not much different from ImageNet, we will fix most of the model parameters and only retrain the last few fully connected layers. Otherwise, we will use the training dataset to perform end-to-end retraining on the entire network. But in both cases, since the model has been pre-trained, the iteration required for retraining will be greatly reduced. Due to the short training time, overfitting can be avoided even if the training dataset is not large enough.This transfer learning is very effective in all disciplines, such as training Chinese models with pre-trained English models.
However, this transfer learning is only suitable for problems that require complex models to extract features. In our project, our example is different from ImageNet, and we need to retrain the model end-to-end. However, when we only need relatively simple underlying factors (color), the training complexity from VGG19 is too high. Therefore, we decided to build a new simpler CNN feature extraction model.
cost function
Not all cost functions are equivalent, it will affect the training difficulty of the model. Some cost functions are quite standard, but some problem domains need to be carefully considered.
Classification problems: Cross entropy, folding loss function (SVM)
Regression: Mean Square Error (MSE)
Object Detection or Segmentation: Intersection and Convergence (IoU)
Strategy Optimization: KL Divergence
Word Embedding: Noise Comparison Estimation (NCE)
Word Vector: Cosine Similarity
Cost Functions that look good in theoretical analysis may not be very useful in practice. For example, the cost function of the discriminator network in a GAN adopts a more practical and experimental approach rather than a method that looks good in theoretical analysis. In some problem domains, the cost function can be partial guess plus partial experiments, or it can be a combination of several cost functions. Our project begins with the standard GAN cost function. In addition, we added reconstruction costs using MSE and other regularization costs. However, how to find a better cost function is one of the unsolved problems in our project, which we believe will have a significant impact on color fidelity.
metric
Good metrics help better compare and adjust models. For special questions, check out the Kaggle platform, which organizes many DL competitions and provides detailed metrics. Unfortunately, in our project, it's hard to define an exact formula to measure the accuracy of artistic rendering.
Regularization
L1 Regularization and L2 Regularization are both common, but L2 Regularization is more popular in deep learning. What are the advantages of
L1 Regularization? L1 regularization can produce more sparse parameters, which helps unravel the underlying representation. Since each non-zero parameter adds a penalty to the cost, L1 prefers zero parameters compared to L2 regularization, i.e. it prefers zero parameters compared to many tiny parameters in L2 regularization. L1 regularization makes the filter cleaner and easier to interpret, so it is a good choice for feature selection. L1 is also less vulnerable to outliers, and if the data is not too clean, the operation will be better. However, L2 regularization is still more popular because the solution may be more stable.
Gradient Descent
always closely monitors whether the gradient disappears or explodes. There are many possible reasons for the gradient descent problem, which are difficult to prove. Do not skip to learning rate adjustments or make model design change too quickly, small gradients may be caused only by programming bugs, such as input data not scaling correctly or weights are all initialized to zero.
If other possible causes are eliminated, gradient truncation is applied when gradient explosion (especially for NLP). Skipping connections is a common technique to alleviate gradient descent problems. In ResNet, the residual module allows inputs to bypass the current layer to the next layer, which effectively increases the depth of the network.
Scaling
Scaling input features. We usually scale features to be mean with zero in a specific range, such as [-1, 1]. Inappropriate scaling of features is one of the most common causes of gradient explosion or reduction. Sometimes we calculate the mean and variance from the training data to make the data closer to the normal distribution. If you scale the validation or test data, use the mean and variance of the training data again.
batch normalization and layer normalization
node output imbalance before each layer activation function is another major source of gradient problems. Batch normalization (BN) is required to apply to CNN if necessary. If input data is properly normalized (scaling) the input data, DN will learn faster and better.In BN, we calculate the mean and variance of each spatial position from each batch of training data. For example, the batch size is 16 and the feature map has a spatial dimension of 10 X10, we calculate 100 mean values and 100 variances (one for each position). The mean at each position is the corresponding position average from 16 samples, and we use the mean and variance to re-normalize the node output for each position. BN improves accuracy while shortening training time.
However, BN is not valid for RNN, we need to use layer normalization. In RNN, the mean and variance from BN are not suitable for re-normalizing the output of RNN units, which may be due to the cyclic properties of the RNN and shared parameters. In layer normalization, the output is re-normalized by the layer output of the current sample. A layer with 100 elements re-normalizes the layer using only one mean variance from the current input.
Dropouttml2
can apply Dropout to the layer to normalize the model. After the rise of batch normalization in 2015, dropout popularity decreased. Batch normalization rescales node output using mean and standard deviation. This is like noise, forcing the layer to learn more robust on variables in the input. Since batch normalization also helps solve the gradient descent problem, it gradually replaced Dropout. The benefits of
combined with Dropout and L2 regularization are domain-specific. Often, we can test dropout during the tuning process and collect empirical data to prove its benefits.
Activation function
In DL, ReLU is the most commonly used nonlinear activation function. If the learning rate is too high, the activation values of many nodes may be at zero. If changing the learning rate doesn't help, we can try leaky ReLU or PReLU. In leaky ReLU, when x 0, it does not output 0, but has a small predefined down slope (such as 0.01 or set by a hyperparameter). Parameter ReLU (PReLU) moves forward. Each node will have a trainable slope.
Split the dataset
To test the actual performance, we divide the data into three parts: 70% for training, 20% for verification, and 10% for testing. Make sure the samples are fully disrupted in each dataset and in each batch of training samples. During the training process, we use the training dataset to build a model with different hyperparameters. We use the validation dataset to run these models and select the most accurate model. But to be safe, we use 10% of the test data for the final confusion check. If your test results differ greatly from the verification results, you should disrupt the data more fully or collect more data.
Baseline
Setting the baseline helps us compare models and Debugs, for example we can use the VGG19 model as the baseline for classification problems. Alternatively, we can first extend some established simple models to solve our problem. This helps us to better understand the problem and establish performance baselines for comparison. In our project, we modified the established GAN implementation and redesigned the generative network as a baseline.
Checkpoint
We regularly save the output and metrics of the model for comparison. Sometimes, we want to reproduce the result of the model or reload the model to train it further. Checkpoints allow us to save the model for reloading later. However, if the model design has been changed, all old checkpoints cannot be loaded. We also use Git tags to track multiple models and reload the correct model for specific checkpoints. Our design takes up 4gb of space per checkpoint. When working in a cloud environment, sufficient storage should be configured accordingly. We often launch and terminate Amazon cloud instances, so we store all files in Amazon EBS for easy reconnection.
Custom Layer The built-in layer in the deep learning package has been better tested and optimized. Nevertheless, if you want to customize the layer, you need:
uses non-random data to perform module testing of forward propagation and backpropagation codes;
compares the backpropagation results with naive gradient checks;
adds a small amount of ϵ to the denominator or uses logarithmic calculations to avoid NaN values.
Normalizes
Deep learning one of the major challenges is reproducibility. During debugging, it is difficult to debug if the initial model parameters remain changing between sessions. Therefore, we explicitly initialize the seeds for all random generators. We initialize seeds for python, NumPy, and TensorFlow in the project. During the finening process, we turn off seed initialization, thus generating a different model for each run. To reproduce the result of the model, we will checkpoint it and reload it later.
Optimizer
Adam Optimizer is one of the most popular optimizers in deep learning. It is suitable for many problems, including models with sparse or noise gradients. Its easy to fine-tune properties allow it to achieve good results quickly. In fact, the default parameter configuration usually works very well. The Adam optimizer combines the benefits of AdaGrad and RMSProp. Adam uses the same learning rate for each parameter and adapts independently as the learning progresses. Adam is a momentum-based algorithm that utilizes historical information of gradients. Therefore, gradient descent can run smoother and suppresses the problem of parameter oscillation due to large gradients and large learning rates.
Adam Optimizer adjusts
Adam has 4 configurable parameters:
learning rate (default 0.001);
β1: the exponential decay rate for the first moment estimate (default 0.9);
β2: the exponential decay rate for the second moment estimate (default 0.999). This value should be set to close to 1 in the sparse gradient problem;
ϵ (default 1e^-8) is a small value used to avoid division by zero operation.
β (momentum) smooths the gradient descent by accumulating historical information of the gradient. Usually for early stages, the default settings already work well. Otherwise, the most likely parameter to be changed should be the learning rate.
Summary
The following is a brief summary of the main steps of a deep learning project:
• Define task (Object detection, Colorization of line arts)• Collect dataset (MS Coco, Public web sites) ◦ Search for academic datasets and baselines ◦ Build your own (From Twitter, News, Website,…)• Define the metrics ◦ Search for established metrics• Clean and preprocess the data ◦ Select features and transform data ◦ One-hot vector, bag of words, spectrogram etc... ◦ Bucketize, logarithm scale, spectrogram ◦ Remove noise or outliers ◦ Remove invalid and duplicate data ◦ Scale or whiten data• Split datasets for training, validation and testing ◦ Visualize data ◦ Validate dataset• Establish a baseline ◦ Compute metrics for the baseline ◦ Analyze errors for area of improvements• Select network structure ◦ CNN, LSTM…• Implement a deep network ◦ Code debugging and validation ◦ Parameter initialization ◦ Compute loss and metrics ◦ Choose hyper-parameters ◦ Visualize, validate and summarize result ◦ Analyze errors ◦ Add layers and nodes ◦ Optimization• Hyper-parameters fine tunings• Try our model variants
Part 4: Visualizing Deep Neural Network Models and Indicators
In terms of troubleshooting deep neural networks, people always draw conclusions too quickly and too early. Before we understand how to troubleshoot, we must first consider what we are looking for and then spend hours tracking down the fault. In this section we will discuss how to visualize deep learning models and performance metrics.
TensorBoard
is very important in every step to track each action and check the results. With the help of pre-set packages such as TensorBoard, visualizing models and performance metrics becomes simple, and the rewards are almost simultaneously.
Data visualization (input, output)
Verify the input and output of the model. Before feeding data to the model, save some training and verification samples for visual verification. Cancel data preprocessing. Re-adjust the pixel value back to [0, 255]. Check multiple batches to make sure we are not repeating data for the same batch. The image on the lower left is some training samples and the verification samples on the lower right is.
Sometimes, it is great to verify the histogram of input data. In perfect case, it should be centered around 0 with intervals between -1 and 1. If the features are on different scales, the gradient either drops or explodes (depending on the learning rate).
periodically saves the output of the corresponding model for verification and error analysis. For example, verify that the color in the output is slightly lighter.
indicators (loss accuracy)
In addition to regularly recording losses and accuracy, we can also record and plot them to analyze their long-term trends. The following figure shows the accuracy and cross entropy loss shown on TensorBoard.
Drawing a loss graph can help us adjust the learning rate. Any long-term increase in losses indicates that learning rates are too high. If the learning rate is low, the learning speed becomes slower.
Here is another real sample with too high learning rate. We can see the loss function suddenly rise (possibly caused by the sudden rise of the gradient).
We use the accuracy graph to adjust the regularization factor. If there is a large gap between validation and training accuracy, the model will be overfitted. To alleviate overfitting, we need to improve the regularization factor.
Summary
Weight Bias: We closely monitor weights and biases. The following figure shows the weights and biases of layer 1 in different training iterations. It is abnormal to have large (positive/negative) weights. The weights of the normal distribution indicate that the training process is smooth (but not necessarily).
Activation: For gradient descent to achieve optimal performance, the node output before the activation function should be normally distributed. If not, then we might apply batch normalization to the convolutional layer, or layer normalization to the RNN layer. We also monitor the number of invalid nodes (0 activations) after activation function.
Gradient: We monitor the gradients of each layer to determine one of the most serious deep learning problems: gradients disappear or explode. If the gradient drops rapidly from the rightmost layer to the leftmost layer, then the gradient disappearance problem occurs.
This may not be very common: we visualize the CNN filter. It identifies the type of features extracted by the model. As shown in the figure below, the first two convolutional layers are detecting boundaries and colors.
For CNN, we can see what the feature map is learning. The following figure captures 9 pictures (to the right) with the highest activation function in a specific figure. It also uses a deconvolution network to reconstruct spatial images from feature maps (left image).
Visualizing and Understanding Convolutional Networks, Matthew D Zeiler et al.
This kind of image reconstruction is rarely carried out. But in generative models, we often change one potential factor and keep others unchanged. It verifies whether the model is learning intelligently.
Dynamic Routing Between Capsules, Sara Sabour, Nicholas Frosst, Geoffrey E Hinton
Part 5: Debugging deep learning network
Problem solving steps for deep learning
In the early development stage, we will encounter multiple problems at the same time. As mentioned earlier, deep learning training consists of millions of iterations. Finding bugs is very difficult and easy to crash. Start with simplicity and make some changes gradually. Regularization of model optimization can be done after the code degug. Check the model in a function-first way:
sets the regularization factor to 0;
does not regularize other (including dropouts);
uses the default Adam optimizer;
uses ReLU;
does not use data enhancement;
has fewer deep network layers;
expands input data, but does not preprocess it unnecessarily;
does not waste time on long-term training iterations or large batch size.
Overfitting the model with a small amount of training data is the best way to debug deep learning. If the loss value does not drop within thousands of iterations, further debgug code.If the accuracy rate exceeds the concept of blind guessing, you will get the first milestone. Then make subsequent modifications to the model: add network layer and customization; start training with complete training data; increase regularized control overfitting by monitoring the accuracy difference between the training and verifying the data sets.
If it gets stuck, remove everything and start with smaller problems.
Initialize hyperparameters
Many hyperparameters are more related to model optimization. Turn off the hyperparameter or use the default value. With the Adam optimizer, it is fast, efficient and has a good default learning rate. The early problems mainly come from bugs, rather than model design and fine adjustment issues. Before doing fine-tuning, go through the checklist below. These problems are more common and are easy to check. If the loss value has not dropped yet, adjust the learning rate. If the loss value drops too slowly, the learning rate increases by 10. If the loss value rises or the gradient explodes, the learning rate decreases by 10. Repeat this process until the loss value gradually decreases. Typical learning rates range from 1 to 1e-7.
Check list
Data:
visualize and check input data (after data preprocessing, before feeding to the model);
check the accuracy of the input tag (after data perturbation);
do not feed the same batch data over and over again;
appropriately scale the input data (generally can be scaled between intervals (-1, 1) and has zero mean);
check the output range (for example, in intervals (-1, 1) );
always uses the mean/variance of the training set to re-adjust the verification/test set;
model all input data has the same dimension;
gets the overall quality of the data set (whether there are too many outliers or bad samples).
model:
model parameters are accurately initialized, and the weight should not be set to 0;
debugged the network layer where the gradient disappears/explosive is done (from the rightmost to the leftmost);
debugged the network layer with most weights of 0 or too large weights;
checks and tests the loss function;
for pre-trained models, the input data range must match the range used in the model;
inference and Dropout in tests should always be turned off.
weight initialization
Initializing all weights to 0 is the most common mistake, and deep networks can't learn anything. The importance of power is initialized according to the Gaussian distribution:
Scaling and normalization
People have a good understanding of scaling and normalization, but this is still one of the most underestimated problems. If both input features and node output are normalized, the model can be trained more easily. If done inaccurately, the loss value will not decrease with the learning rate. We should monitor the input features and the histograms of the node outputs per layer. Scaling the input appropriately. For the output of the node, the perfect shape is zero mean and the value is not too large (positive or negative). If you are not and encounter gradient problems in this layer, batch normalization is done on the convolutional layer and layer normalization is done on the RNN unit.
loss function
checks and tests the accuracy of the loss function. The loss value of the model must be lower than the random guessed value. For example, in the 10-category classification problem, the random guessed cross entropy loss is -ln(1/10).
Analyze error
Check for areas with poor performance (error) and improve it, and visualize the error. In our project, model performance is not good for images with highly entangled structures. For example, add more convolutional layers with smaller filters to detach small features. If necessary, enhance the data or collect more similar samples to better train the model. In some scenarios, you may want to remove these samples, limiting them to more focused models.
Regularization fine
Turn off regularization (make the model overfitting) until a reasonable prediction is made.
Once the model code can work, the next parameter to be adjusted is the regularization factor. We need to increase the volume of the training data and then increase regularization to narrow the difference between training and verification accuracy. Don't overdo it, because we want to overfit the model a little. Monitor data closely and regularize costs. On a long-term scale, regularization losses should not control data losses. If using large regularization cannot narrow the gap between the two accuracy rates, then first degure the regularization code or method.
is similar to the learning rate, we change the test value in logarithmic proportions, for example, 1/10 at the beginning. Note that each regularization factor may be a completely different order of magnitude, and we can adjust these parameters repeatedly.
more than loss functions
In the first implementation, avoid using multiple data loss functions. The weight of each loss function may be of different orders of magnitude and requires some effort to adjust. If we only have one loss function, we can only care about the learning rate.
Fixed variable
When we use pretrained models, we can fix model parameters of a specific layer, thereby speeding up the calculation. Be sure to check again for any variable fixed errors.
Unit Test
As rarely mentioned, we should unit test the core modules so that the implementation is still robust when the code changes. If its parameters are initialized with a randomizer, it is not easy to check the output of a network layer. In addition, we can imitate input data and check output. For each module (layer), we can check: the shape of the training and inference output of
; the number of trainable variables (not the number of parameters).
Dimensional error matches
. You must keep track of the shape of the Tensor (matrix) and archive it into the code. For Tensors whose shape is [N, channel, W, H], if W (width) and H (height) have the same dimensions, there will be no error in exchanging the code between the two. Therefore, we should use asymmetric shapes to perform code unit testing. For example, we use [4, 3]Tensor instead of [4, 4] to do the test.
Part 6: Improve deep learning model performance and network parameter adjustment
Improve model capacity
To improve model capacity, we can gradually add layers and nodes to deep network (DN). Deeper layers will output more complex models. We can also reduce the filter size. Smaller filters (3×3 or 5×5) generally perform better than larger filters.
parameter adjustment process emphasizes practice rather than theory. We gradually add layers and nodes to overfit the model because we can then lower it in a regular way. Repeat this iteration process until the accuracy rate no longer improves, is no longer worth training and degradation of computing performance.
However, the GPU's memory is limited. As of early 2018, the high-end graphics card NVIDIA GeForce GTX 1080 TI had 11GB of memory. The maximum number of hidden nodes between two affine layers is limited by the memory size.
For very deep networks, the problem of gradient vanishing is very serious. We can add jump connections (similar to residual connections in ResNet) to alleviate this problem.
Model Dataset design changes
The following is a checklist for improving performance:
Analyze errors in the verification dataset (poor prediction results);
monitors activation function. When the activation function is not centered on zero or non-normal, consider batch normalization or layer normalization;
monitors the proportion of invalid nodes;
uses gradient truncation (especially in NLP tasks) to control gradient explosion problems;
Shuffle dataset (manual or through program);
balanced dataset (each category has a similar number of samples).
We should closely monitor the activation histogram before activating the function. If their size varies greatly, then gradient descent will be ineffective. Use normalization. If the deep network has a large number of invalid nodes, then we should track the issue further.It may be caused by bugs, weight initialization, or gradient disappearance. If none of them are, experiment with some advanced ReLU functions, such as leaky ReLU.
Dataset Collection Clean
If you want to build your own dataset, the best advice is to carefully study how to collect samples. Find the best resources, filter out all data that is not related to your problem, and analyze errors. In our project, images with highly entangled structures are very poor. We can add convolutional layers and small filters to change the model. But the model is already difficult to train. We can add more entanglement samples for further training, but there are already a lot... Another way: we can refine the scope of the project and narrow the scope of the sample.
Data Enhancement
Collecting tagged data is an expensive task. For pictures, we can use data augmentation methods such as rotation, random clipping, shifting, etc. to modify existing data to generate more data. Color distortion includes hue, saturation, and exposure offset.
Semi-supervised learning
We can also supplement training data using labelless data. Use the model to classify the data. Add samples with high confidence predictions to training data sets with corresponding label predictions.
Adjust
Learning rate Adjust
Let’s first briefly review how to adjust the learning rate. In the early stages of development, we turned off any non-critical hyperparameters or set them to 0, including regularization. With the Adam optimizer, the default learning rate usually performs well. If we have confidence in our code, but the losses do not drop, we need to adjust the learning rate. Typical learning rates are between 1 and 1e-7. Reduce the learning rate by 10% each time and test it in a short iteration to closely monitor the losses. If it continues to rise, then the learning rate is too high. If it doesn't drop, the learning rate is too low. Improve learning rate until the loss becomes smoother.
is a real sample, showing that the learning rate is too high, which leads to a sudden increase in costs:
In practice that is not often used, people monitor the update of W ratio:
If ratio 1e-3, consider lowering the learning rate;
If ratio 1e-3, consider improving the learning rate.
Hyperparameter adjustment
After the model design is stable, we can also further adjust the model. The most frequently adjusted hyperparameters are:
mini-batch size;
learning rate;
regularization factor;
specific layer hyperparameters (such as dropout).
Mini-batch Size
The usual batch size is 8, 16, 32 or 64. If the batch size is too small, the gradient descent will not be smooth, the model is learned slowly, and the loss may oscillate. If the batch size is too large, it takes too long to complete a training iteration (one round of updates) and the result is small. In our project, we lower the batch size because each training iteration takes too long. We closely monitor the overall learning speed and losses. If the loss oscillates violently, we will know that the batch size reduction is too large. Batch size affects hyperparameters such as regularization factors. Once we have determined the batch size, we usually lock the value.
Learning rate Regularization factor
We can use the above method to further adjust the learning rate and regularization factor. We monitor the loss to control the gap between learning rate and validation and training accuracy, thereby adjusting the regularization factor. Instead of reducing our learning rate by 10%, we lowered it by 3% (maybe smaller in fine adjustment).
parameter adjustment is not a linear process. The hyperparameters are related, and we will adjust the hyperparameters repeatedly. Learning rate and regularization factors are highly correlated and sometimes need to be adjusted together. Don't make fine adjustments too early, as it may be a waste of time. If the design changes, these efforts will be in vain. The
Dropout rate is usually between 20% and 50%. Let's start with 20%. If the model is overfitted, the value is increased.
Other adjustments to
sparseness
activation function
model parameters can make computational optimization simple and reduce energy consumption (this is crucial for mobile devices). If needed, we can replace L2 regularization with L1 regularization. ReLU is the most popular activation function. For some deep learning competitions, people use more advanced ReLU variants to improve accuracy. In some scenarios it can also reduce invalid nodes.
Advanced parameter adjustment
Some advanced fine parameter adjustment methods:
Learning rate attenuation scheduling
Momentum
Early stop
We do not use a fixed learning rate, but regularly reduce the learning rate. Hyperparameters include the frequency and amplitude of the learning rate drop. For example, you can reduce the learning rate by 0.95 per hundred thousand iterations. To adjust these parameters, we need to monitor costs to determine that the parameters drop faster but not smooth too early.
Advanced Optimizer uses momentum to make the gradient descent process smooth. There are two momentum settings in the Adam optimizer, which controls first-order (default 0.9) and second-order (default 0.999) momentum respectively. For problem areas with steep gradient descent, such as NLP, we can slightly increase the momentum value.
When the verification error continues to rise, overfitting can be alleviated by stopping training.
However, this is just a visualization of the concept. Real-time errors may rise temporarily and then fall again. We can check the model regularly and record the corresponding verification errors. Let's select the model later.
grid search
Some hyperparameters are highly correlated. We should adjust them together using a probability grid on a logarithmic scale. For example: for the two hyperparameters λ and γ, we start with the corresponding initial values and reduce it by 10 times in each step:
(e-1, e-2, … and e-8);
(e-3, e-4, … and e-6).
The corresponding grid will be [(e-1, e-3), (e-1, e-4), … , (e-8, e-5) and (e-8, e-6)].
instead of using explicit intersections, we moved these points slightly randomly. This randomness may help us discover some hidden properties. If the best point is at the boundary of the grid (blue point), we will retest in the boundary area.
grid search is very computationally important. For smaller projects, they will be used sporadically. We started to adjust the coarse-grained parameters with fewer iterations. In the later fine adjustment phase, we use longer iterations and adjust the value to 3 (or lower).
Model Collection
In machine learning, we can vote from the decision tree to make predictions. This method is very effective because judgment errors are usually of local nature: the chances of the same error occurring in both models are very small. In deep learning, we can start training with random guesses (submit a random seed without explicit settings), and the optimization model is not unique either. We can use the validation dataset to test and select the best-performing model multiple times, or we can have multiple models vote internally to finally output the prediction results. This method requires multiple sessions, which is definitely very consuming system resources. We can also train once, examine multiple models, and then select the best performing model in the process. Through the set model, we can make accurate predictions based on these:
"vote" predicted by each model;
performs weighted votes based on the predicted confidence.
model set is very effective in improving the prediction accuracy of some problems and is often adopted by teams in deep learning data competitions.
model improvement
In addition to fine-tuning the model, we can also try to use different variants of the model to improve performance. For example, we might consider using a color generator to partially or completely replace the standard LSTM. This concept is no stranger: we can draw pictures in steps.
Intuitively, it is advantageous to introduce a time series method in image generation tasks, which has been proven in DRAW: A Recurrent Neural Network For Image Generation. The significant improvement in
performance is often caused by the change in model design. However, sometimes fine-tuning the model can also improve the performance of machine learning. The final judgment may depend on your benchmark test results for the corresponding task.
Kaggle
During the development process, you may have some simple questions, such as: Do I need to use Leak ReLU? . Sometimes the question is simple, but you can never find the answer anywhere. In some papers, you will see the superiority of Leak ReLU, but experience from other projects shows no performance improvement. Too many items and too many variables lack validation results that measure multiple possibilities. Kaggle is an open platform for the data science competition, where deep learning is an important part. By looking at some of the methods of excellent players in depth, you may be able to find the most common performance indicators. Moreover, some data competition teams will upload their own code (called kernel) open source. Just be careful to explore, Kaggle will be a great source of information.
Experimental Framework
Deep learning development requires a lot of experience, and adjusting hyperparameters is a very tedious task. Creating an experimental framework can speed up this process. For example: Some people develop code to externalize model definitions into strings for adjustment. However, these efforts usually do not bring benefits to small teams. In my experience, the loss of simplicity and traceability of the code will be much greater than the benefit of doing so, which means it is difficult to make simple modifications to the code. Easy-to-read code must have concise and flexible features. In contrast, many AI cloud products have begun to provide the feature of automatically adjusting hyperparameters. Although this technology is still in its initial stage, the process of not requiring humans to write their own frameworks should be the general trend. Please always pay attention to this trend.
Conclusion
Now, you have the adjusted model and can be officially deployed. Hope this series of tutorials will help you. Deep learning can help us solve many problems—its scope is beyond your imagination. Want to use deep learning instead of front-end design? You can try pix2code!
Original link:
https://medium.com/@jonathan_hui/how-to-start-a-deep-learning-project-d9e1db90fa72
https://medium.com/@jonathan_hui /build-a-deep-learning-dataset-part-2-a6837ffa2d9e
https://medium.com/@jonathan_hui/deep-learning-designs-part-3-e0b15ef09ccc
https://medium.com/@jonathan_hui/deep-learning-designs-part-3-e0b15ef09ccc
https ://medium.com/@jonathan_hui/visualize-deep-network-models-and-metrics-part-4-9500fe06e3d0
https://medium.com/@jonathan_hui/debug-a-deep-learning-netwo rk-part-5-1123c20f960d
https://medium.com/@jonathan_hui/improve-deep-learning-models-performance-network-tuning-part-6-29bf90df6d2d