selected from Medium
Author: Florian Ernst
Heart of Machine Compilation
Edited by Xiaozhou, Chen Ping
. Lightning The training speed is slower. Have you encountered this situation?
PyTorch Lightning is a tool for refactoring PyTorch code. It can extract complex and repetitive parts of the code, making AI research scalable and iterative quickly. However, recently a blogger named Florian Ernst discovered that there is a bug in PyTorch Lightning - making the training that should have been accelerated slower.
Author of this article Florian Ernst
Ernst The blog wrote a blog in detail to describe the process of discovering this bug. The following is the original text of the blog.
Two weeks ago, I refactored some deep learning code into Pytorch Lightning, which is expected to accelerate about 1.5 times. However, the speed of training, evaluation and testing tasks dropped to 1/4 of the original speed. The neural network after reconstruction takes several days to run to get the result, so I wanted to find out why and minimize the training time as much as possible.
things are like this, I'm using some open source deep learning code that is used to showcase the latest architecture of certain machine learning tasks. However, these codes themselves are neither neat nor optimized. I noticed a few places that could be accelerated and refactored the code into Pytorch code, making the training about 3 times faster.
But I think there is room for improvement. Pytorch Lightning is a very good tool: it removes a lot of boilerplate code and comes with some optimization methods, so I decided to refactor these codes using Lightning.
I originally hoped that the code would speed up by about 1.5 times, but when I finished the refactoring, I was surprised to find that the iteration time changed from 4 seconds to 15 seconds, which made nearly 3 times more training time. What is the problem with
?
I first ran Lightning's analyzer to find out what the problem is.
Basic analyzer gave me a starting point: most of my time is spent running an epoch; the advanced analyzer did not give me more information.
I'm wondering if I've misconfigured some hyperparameters on the neural network. I messed up some of these hyperparameters and the training speed did not change at all.
Then I adjusted the data loader and found that changing the number of jobs n_jobs will have an impact on the total training time. However, the impact does not speed up the calculation speed, but slows down.
As the number of jobs changes, the time spent on 100 epochs.
Using n_jobs=0 to completely disable multiprocessing makes my iteration almost 2 times faster than using 6 kernels. By default, Pytorch kills the running process (worker) between two epochs and reloads, thus requiring reloading the dataset.
In my example, the dataset is loading very slowly. I set the persistent_workers parameter in the DataLoader to True to prevent running processes from being killed and thus prevent data from being reloaded.
# My data Loader parametersDataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=n_workers, persistent_workers=True, pin_memory=True,)
Therefore, there are two possibilities:
- Pytorch Lightning kill drops the worker, and does not consider the persistent_workers parameter;
- problem lies elsewhere.
I created an issue on GitHub, hoping that Lightning team awareness is the problem, and next I will look for the root cause of the problem.
GitHub Address: https://github.com/PyTorchLightning/pytorch-lightning/issues/10389
Looking for the root of the problem
Lightning proFiler runs with the context manager and calculates the time spent on a given block. It can easily search for specific profiler operations, taking "run_training_epoch" as an example.
I started to explore the Lightning source code and check the instructions that cause loops to slow down. I found some problems: Loop.run calls Loop.on_run_start, Loop.on_run_start to reload the dataloader, as shown in the figure below:
Loop.run calls Loop.on_run_start…
Loop.on_run_start to re-call The dataloader
problem does seem to come from reloading DataLoader in each epoch. Looking at the source code of DataLoader, I found it like this:
When using persistent_workers 0 to iterate over DataLoader, if _iterator` is None, use _get_iterator() to reload the entire dataset. What is certain is that Pytorch Lightning wrongly resets _iterator, which causes this problem.
To confirm this discovery, I replaced DataLoader with a custom overloadable __iter__ method:
As expected, after iteration, the _iterator property is set correctly, but is reset to None before the next epoch starts.
n_jobs=1, persistent_workers=True
Now, I just need to know when the attribute is set to None so that I can find the root of the problem. I tried using the debugger but the program crashed due to multiprocess or CUDA. I started using Python getter & setter usage:
When DataLoader._iterator is set to None, stack trace
This is very effective. The following content will be output:
File "trainer\trainer.py", line 1314, in _run_trainself.fit_loop.run()...File "loops\fit_loop.py", line 234, in advance.epoch_loop.run(data_fetcher)File "loops\base.py", line 139, in rund self.on_run_start(*args, **kwargs)File "loops\epoch\training_epoch_loop.py", line 142, in on_run_startself._dataloader_iter = _update_dataloader_iter(...)File "loops\utilities.py", line 121, in _update_dataloader_iterdataloader_iter = enumerate(data_fetcher, batch_idx)File "utilities\fetching.py", line 198, in __iter__self.reset()File "utilities\fetching.py", line 212, in resetself.dataloader.reset()...File "trainer\supporters.py", line 498, in _shutdown_workers_and_reset_iteratordataloader._iterator = None
Through tracking, DataLoader.reset is called every time you start running. After digging into the code, I found that every iteration resets the DataFetcher, resulting in the DataLoader being reset too. There is no condition in the code to avoid resetting: every epoch must reset the DataLoader.
This is the root cause of my findings slow iteration.
fix bug
Since a bug has been found, you need to find a way to fix it. Fixing the bug is very simple: I removed the self.reset line from the DataFetcher __iter__ method:
was trained again after modification. Now it only takes 1.5 seconds for one iteration, which previously took 15 seconds, and it also took 3 seconds for vanilla Pytorch. In comparison, the speed has indeed improved a lot.
I reported this bug I found to the Lightning team, who fixed the issue and pushed the patch the next day. I then updated the library and after the update I found that their fix did work. I believe more people will benefit from this fix and the training and testing time of their Lightning model will be improved. If you haven't updated your dependencies recently, try installing pytorch-lightning==1.5.1 or later!
Original link: https://medium.com/@florian-ernst/finding-why-pytorch-lightning-made-my-training-4x-slower-ae64a4720bd1