Pytorch Lightning Complete Walkthrough

2022/03/0821:14:26 hotcomm 2244

author | Takanashi@zhihu (authorized)

source | https://zhuanlan.zhihu.com/p/353985363

editor | pole city platform

This article has been authorized by the author , please contact the original work for reprinting

written in front

Pytorch-Lightning I have "discovered" this library twice. When I first discovered it, I felt that it was heavy and difficult to learn, and I didn't seem to be able to use it myself. However, some slightly higher-level requirements began to appear with the projects I was working on. I found that I always spent a lot of time on similar engineering codes. Debugging was also the most time spent on these codes, and a contradiction gradually emerged. : If you want more and better features like TensorBoard support, Early Stop, LR Scheduler, distributed training, quick testing, etc., the code will inevitably get longer and look more messy, At the same time, the core training logic is gradually overwritten by these engineering codes. So is there a better solution that can even solve all these problems with one click?

So I discovered Pytorch-Lightning for the second time.

Really fragrant.

But the problem still comes. The framework is not made easier to learn by incense. The tutorials on the official website are very rich, and it can be seen that the developers are working hard. However, many connected knowledge points are distributed in different sections, and some core understanding points have not been emphasized, but have been mentioned in small print, which makes me want to make a inclusive tutorial, including all the Concepts considered important in the learning process,Easy-to-use parameters, some attention points, pits, a large number of sample code segments and a concentrated explanation of some core issues.

Finally, the third part provides a template that I have summarized that is easy to use for large projects, easy to migrate, and easy to reuse. If you are interested, you can go to GitHub — https://github.com/miracleyoo/pytorch- Lightning-template trial.

core

  • A great feature of Pytorch-Lighting is that it separates the model from the system. The model is a pure model like Resnet18, RNN, and the system defines how a set of models interact with each other, such as GAN (Generator Network and Discriminator Network), Seq2Seq (Encoder and Decoder Network) and Bert. At the same time, sometimes the problem involves only one model, then the system can be a general system that describes how the model is used and can be reused in many other projects.
  • The core design philosophy of Pytorch-Lighting is "self-sufficiency". Each network also includes how to train, how to test, and optimizer definitions.

Pytorch Lightning Complete Walkthrough - DayDayNews

Recommended method

This part is placed at the front, because the full text is too long, it is easy to ignore this part of the essence if you put it later.

Pytorch-Lightning is a good library, or rather an abstraction and wrapper for pytorch. Its advantages are strong reusability, easy maintenance, and clear logic. The disadvantage is also obvious. There is still a lot of content to learn and understand in this package, or in other words,very heavy. If you write the code directly according to the official template, the small project is fine. If it is a large project, there are multiple models and data sets that need to be debugged and verified, it will not be easy to handle, or even more troublesome. After a few days of exploration and debugging, I have concluded the following set of useful templates, which can also be said to be a further abstraction of Pytorch-Lightning.

Welcome everyone to try this set of code styles. If you are used to it, it is quite easy to reuse, and it is not easy to retreat halfway.

  root- |-data |-__init__.py |-data_interface.py |-xxxdataset1.py |-xxxdataset2.py |-... |-model |-__init__.py |-model_interface.py |-xxxmodel1 .py |-xxxmodel2.py |-... |-main.py  

If you directly upload plmodule to each model, the conversion of existing projects, other people's code, etc. will be quite time-consuming. Also, in this case, you need to add some similar code to each model, like training_step , validation_step . Obviously, this is not what we want,If you do, not only is it not easy to maintain, but it may be more messy. Similarly, if each dataset class is directly converted into pl's DataModule, it will face similar problems. Based on this consideration, I recommend using the above architecture:

  • Put only one main.py file in the main directory.
  • data and modle Put __init__.py into two folders. This makes it easy to import. Two init file are: from .data_interface import DInterface and from .model_interface import MInterface
  • in data_interface in Create a class DInterface(pl.LightningDataModule): is used as the interface for all dataset files. __init __ () Dataset class corresponding to the import function, setup () instantiated and added as needed honestly train_dataloader , val_dataloader , test_dataloader function. These functions tend to be all similar, with several input args controlling different parts.
  • Similarly, class MInterface(pl.LightningModule): class is established in model_interface as the intermediate interface of the model. __init__() import the corresponding model class in the function,Then honestly add configure_optimizers , training_step , validation_step and other functions to control all model functions with an interface class. Different parts are controlled using input parameters.
  • main.py function is only responsible for: the definition parser, parse add item; selecting the required callback function; instantiating MInterface , DInterface , Trainer .

Done.

full template can be found on GitHub: https://github.com/miracleyoo/pytorch-lightning-template.

Lightning Module

Introduction

Homepage: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html

Three core components:

pan model ulli li40ul
  • Optimizer
  • Train/Val/Test step
  • Data flow pseudo code:

      outs = []for batch in data: out = training_step(batch) outs. append(out)training_epoch_end(outs)  

    Equivalent Lightning code:

      def training_step(self, batch, batch_idx): prediction = ... return predictiondef training_epoch_end(self, training_step_outputs): for prediction in predictions: # do something with these  

    Just like filling in the blanks, fill in these functions.

    components and functions

    API page: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html%23lightningmodule-api

    A Pytorch-Lighting model must contain The parts are:

    • init : initialization, including model and system definitions.
    • training_step(self, batch, batch_idx) : The processing function of each batch.

    parameters:
    batch ( Tensor | ( Tensor , ...) | [ Tensor , ...]) -. The output of your DataLoader A tensor, tuple or list
    batch_idx ( int ) -. Integer displaying index of this batch
    optimizer_idx ( int ) – When us . Ing multiple optimizers, this argument will also be present
    hiddens ( Tensor ) - Passed in if truncated_bptt_steps> 0.

    Return Value: Any of

      .
    • Tensor - The loss tensor
    • dict -. A dictionary Can include any keys, but must include the key 'loss'
    • None - Training will skip to the next batch

    The return value needs to have a loss anyway.If it is a dictionary, this key is required. The batch is skipped without loss. Example:

      def training_step(self, batch, batch_idx): x, y, z = batch out = self.encoder(x) loss = self.loss(out, x) return loss# Multiple optimizers (eg: GANs )def training_step(self, batch, batch_idx, optimizer_idx): if optimizer_idx == 0: # do training_step with encoder if optimizer_idx == 1: # do training_step with decoder # Truncated back-propagation through timedef training_step(self, batch, batch_idx, hiddens): # hiddens are the hidden states from the previous truncated backprop step ... out, hiddens = self.lstm(data, hiddens) ... return {'loss': loss, 'hiddens': hiddens}  

    configure_optimizers : optimizer definitions,Returns one optimizer, or several optimizers, or two Lists (optimizers, Scheduler). Such as:

      # most casesdef configure_optimizers(self): opt = Adam(self.parameters(), lr=1e-3) return opt# multiple optimizer case (eg: GAN)def configure_optimizers(self): generator_opt = Adam (self.model_gen.parameters(), lr=0.01) disriminator_opt = Adam(self.model_disc.parameters(), lr=0.02) return generator_opt, disriminator_opt# example with learning rate schedulersdef configure_optimizers(self): generator_opt = Adam(self. model_gen.parameters(), lr=0.01) disriminator_opt = Adam(self.model_disc.parameters(), lr=0.02) discriminator_sched = CosineAnnealing(discriminator_opt, T_max=10) return [generator_opt, disriminator_opt], ​​[discriminator_sched]# example with step -based learning rate schedulersdef configure_optimizers(self): gen_opt = Adam(self.model_gen.parameters(), lr=0.01) dis_opt = Adam(self.model_disc.parameters(), lr=0.02) gen_sched = {'scheduler': ExponentialLR (gen_opt, 0.99), 'interval': 'ste p'} # called after each training step dis_sched = CosineAnnealing(discriminator_opt, T_max=10) # called every epoch return [gen_opt, dis_opt], ​​[gen_sched, dis_sched]# example with optimizer frequencies# see training procedure in `Improved Training of Wasserstein GANs`, Algorithm 1# https://arxiv.org/abs/1704.00028def configure_optimizers(self): gen_opt = Adam(self.model_gen.parameters(), lr=0.01) dis_opt = Adam(self.model_disc.parameters() , lr=0.02) n_critic = 5 return ( {'optimizer': dis_opt, 'frequency': n_critic}, {'optimizer': gen_opt, 'frequency': 1} )  

    The components that can be specified are:

    • forward : same as normal nn.Module ,for inference. When called internally: y=self(batch)
    • training_step_end : Use this function only when training with multiple nodes and the result involves steps such as softmax that require joint operations on all outputs . Similarly, validation_step_end / test_step_end .
    • training_epoch_end : Called at the end of a training epoch; input parameter: a List, the content of the List is the content of each time returned by the previous training_step() ; return: None
    • validation_step(self, batch, batch_idx) / test_step(self, batch, batch_idx) : no return value limit,It is not necessary to output a val_loss .
    • validation_epoch_end / test_epoch_end

    tool functions are:

    • freeze : freeze all weights for the prediction time use. Only used when the training has been completed and only later is tested.
    • print : Although the built-in print function can also be used, if the program runs in a distributed system, it will print multiple times. And using self.print() will only print once.
    • log : log loggers such as TensorBoard,For each log scalar, there will be a corresponding abscissa, which may be a batch number or an epoch number. And on_step means that the abscissa of the log amount is expressed as the current batch, and on_epoch means that the log amount is accumulated over the entire epoch and then logged The coordinates are the current epoch.

    LightningMoule Hook

    on_step

    on_epoch

    prog_bar

    logger

    * also applies to the test loop

    parameters:
    name ( str ) – key name
    _span2sp an value ( Any ) - value name
    prog_bar ( bool ) - if True logs to the progress bar logger
    ( bool ) - if True logs to the logger
    on_step ( Optional [ bool ]) - if True logs at this step. None auto-logs at the training_step but not validation/test_step
    on_epoch _strong34 1strong ( Optional [ bool ]) -. If True logs epoch accumulated metrics None auto-logs the val / test step but not training_step
    reduce_fx at ( Callable ) - reduction function over step values for end of epoch Torch.mean by default
    tbptt_reduce_fx ( Callable ) -. function to reduce on truncated back prop
    tbptt_pad_token ( int ) – token to use for padding
    strong338st rong enable_graph
    ( bool ) - if True, will not auto detach the graph
    sync_dist ( bool ) - if True, reduces the metric across GPUs / TPUs
    sync_dist_op ( Union [ Any , str ]) - the op to sync across GPUs / TPUs
    sync_dist_group ( Optional [ Any _span2s pan ]) – the ddp group

    • log_dict : The only difference between the log function is, name and value variables are replaced by a dictionary. Indicates that multiple values ​​are logged at the same time. For example: python values ​​= {'loss': loss, 'acc': acc, ..., 'metric_n': metric_n} self.log_dict(values)
    • save_hyperparameters : save_ span2span init All hyperparameters entered in . Subsequent accesses can be made by means of self.hparams.argX . At the same time, the hyperparameter table will also be saved to the file.

    function built-in variables:

    • device : You can use self.device to build a device-independent tensor.For example: z = torch.rand(2, 3, device=self.device) .
    • hparams : Contains all previously saved input hyperparameters.
    • precision : precision. 32 and 16 are common.

    points

    If you plan to use DataParallel, in writing training_step when you need to call forward function, z = self (x)

    template

      class LitModel(pl.LightningModule): def __init__(...): def forward(...): def training_step(...) def training_step_end(...) def training_epoch_end(...) def validation_step(... ) def validation_step_end(...) def validation_epoch_end(...) def test_step(...) def test_step_end(...) def test_epoch_end(...) def configure_optimizers(...) def any_extra_hook(...)  

    Trainer

    base use

      model = MyLightningModule() trainer = Trainer() trainer.fit(model, train_dataloader, val_dataloader)  

    if connected 2 validation_stepspan n1span has none,That val_dataloader is fine.

    pseudocode and hooks

    Hooks page: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html%23hooks

      def fit(...): on_fit_start( ) if global_rank == 0: # prepare data is called on GLOBAL_ZERO only prepare_data() for gpu/tpu in gpu/tpus: train_on_device(model.copy()) on_fit_end()def train_on_device(model): # setup is called PER DEVICE setup() configure_optimizers() on_pretrain_routine_start() for epoch in epochs: train_loop() teardown() def train_loop(): on_train_epoch_start() train_outs = [] for train_batch in train_dataloader(): on_train_batch_start() # ----- train_step methods ------- out = training_step(batch) train_outs.append(out) loss = out.loss backward() on_after_backward() optimizer_step() on_before_zero_grad() optimizer_zero_grad() on_train_batc h_end(out) if should_check_val: val_loop() # end training epoch logs = training_epoch_end(outs)def val_loop(): model.eval() torch.set_grad_enabled(False) on_validation_epoch_start() val_outs = [] for val_batch in val_dataloader(): on_validation_batch_start() # -------- val step methods ------- out = validation_step(val_batch) val_outs.append(out) on_validation_batch_end(out) validation_epoch_end(val_outs) on_validation_epoch_end() # set up for train model.train() torch.set_grad_enabled(True)  

    recommended parameters

    parameter introduction (with video) — https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html%23trainer -flags

    class definition and default parameters — https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html%23trainer-class-api

    default_root_dir : default storage address.All experimental variables and weights will be stored in this folder. The recommendation is to have a separate folder for each model. Each retraining results in a new version_x subfolder.

    max_epochs : Maximum number of training epochs. trainer = Trainer(max_epochs=1000)

    min_epochs : Minimum number of training epochs. Used when there is an Early Stop.

    auto_scale_batch_size : Automatically select an appropriate batch size before doing any training.

      # default used by the Trainer (no scaling of batch size) trainer = Trainer(auto_scale_batch_size=None)# run batch size scaling, result overrides hparams.batch_size trainer = Trainer(auto_scale_batch_size='binsearch')# call tune to find the batch sizetrainer.tune(model)  

    auto_select_gpus : Automatically select the appropriate GPU.Especially useful when the GPU is in exclusive mode.

    auto_lr_find : Automatically find a suitable initial learning rate. Techniques from the https://arxiv.org/abs/1506.01186 paper were used. Works if and only if the trainer.tune(model) code is executed.

      # run learning rate finder, results override hparams.learning_ratetrainer = Trainer(auto_lr_find=True)# run learning rate finder, results override hparams.my_lr_argtrainer = Trainer(auto_lr_find='my_lr_arg')# call tune to find the lrtrainer. tune(model)  

    precision : precision. The normal value is 32. Using 16 can reduce memory consumption and increase batch.

      # default used by the Trainertrainer = Trainer(precision=32)# 16-bit precisiontrainer = Trainer(precision=16, gpus=1)  

    val_check_interval : Period for Validation test.Normal is 1, training 1 epoch and testing 4 times is 0.25, and testing every 1000 batches is 1000.

    use (float) to check within a training epoch: At this time, this value is a percentage of an epoch. How many tests per percent. use (int) to check every n steps (batches): Test every n steps.

      # default used by the Trainertrainer = Trainer(val_check_interval=1.0)# check validation set 4 times during a training epochtrainer = Trainer(val_check_interval=0.25)# check validation set every 1000 training batches# use this when using iterableDataset and your dataset has no length# (ie: production cases with streaming data) trainer = Trainer(val_check_interval=1000)  

    gpus : Controls the number of GPUs used. When set to None, the cpu is used.

      # default used by the Trainer (ie: train on CPU) trainer = Trainer(gpus=None)# equivalenttrainer = Trainer(gpus=0)# int: train on 2 gpustrainer = Trainer(gpus=2)# list : train on GPUs 1, 4 (by bus ordering) trainer = Trainer(gpus=[1, 4]) trainer = Trainer(gpus='1, 4') # equivalent# -1: train on all gpustrainer = Trainer(gpus =-1)trainer = Trainer(gpus='-1') # equivalent# combine with num_nodes to train on multiple GPUs across nodes# uses 8 gpus in total trainer = Trainer(gpus=2, num_nodes=4)# train only on GPUs 1 and 4 across nodes trainer = Trainer(gpus=[1, 4], num_nodes=4)  

    limit_train_batches : Percentage of training data used.Use this if you have too much data, or are debugging. The range of values ​​is 0~1. Similarly, there are limit_test_batches , limit_val_batches .

      # default used by the Trainertrainer = Trainer(limit_train_batches=1.0)# run through only 25% of the training set each epochtrainer = Trainer(limit_train_batches=0.25)# run through only 10 batches of the training set each epochtrainer = Trainer (limit_train_batches=10)  

    fast_dev_run : bool amount. If set to true, only one batch of train, val and test will be executed, and then end. For debugging only.

    Setting this argument will disable tuner, checkpoint callbacks, early stopping callbacks, early stopping callbacks, loggers and logger callbacks like LearningRateLogger and runs for only 1 epoch

    pre (fast_dev_run=False)# runs 1 train, val, test batch and program ends trainer = Trainer(fast_dev_run=True)# runs 7 train, val, test batches and program ends trainer = Trainer(fast_dev_run=7)

    .fit() Function

    Trainer.fit(model, train_dataloader=None, val_dataloaders=None, datamodule=None) : The first quantity of input must be model,It can then be followed by a LintningDataModule or a normal Train DataLoader. If Val step is defined, also Val DataLoader.

    parameters:
    datamodule ([Optional] [LightningDataModule]) - A instance of LightningDataModule
    model [LightningModule] - Model to fit
    train_dataloader.. ([Optional] [DataLoader]) – A Pytorch DataLoader with training samples. If the model has a predefined train_dataloader method this will be skipped.
    val_dataloaders ( Union [DataLoader] ,List[DataLoader], None) – Either a single Pytorch Dataloader or a list of them, specifying validation samples. If the model has a predefined val_dataloaders method this will be skipped

    Other points

    • .test( ) will not run unless called directly. trainer.test()
    • .test() will automatically load the optimal model.
    • model.eval() and torch.no_grad() is automatically called when testing.
    • By default, Trainer() runs on the CPU.

    Example of use

    1. Manually add command line parameters:

      from argparse import ArgumentParserdef main(hparams): model = LightningModule() trainer = Trainer(gpus=hparams.gpus) trainer.fit (model)if __name__ == '__main__': parser = ArgumentParser() parser.add_argument('--gpus', default=None) args = parser.parse_args() main(args)  

    2. Automatically add all Trainer will use command line parameters:

      from argparse import ArgumentParserdef main(args): model = LightningModule() trainer = Trainer.from_argparse_args(args) trainer.fit(model)if __name__ == '__main__': parser = ArgumentParser() parser = Trainer.add_argparse_args( # group the Trainer arguments together parser.add_argument_group(title="pl.Trainer args") ) args = parser.parse_args() main(args)  

    3. Hybrid,Use both Trainer related parameters, and use some custom parameters, such as various model hyperparameters:

      from argparse import ArgumentParser import pytorch_lightning as plfrom pytorch_lightning import LightningModule, Trainerdef main(args): model = LightningModule() trainer = Trainer.from_argparse_args(args) trainer.fit(model)if __name__ == '__main__': parser = ArgumentParser() parser.add_argument('--batch_size', default=32, type=int) parser. add_argument('--hidden_dim', type=int, default=128) parser = Trainer.add_argparse_args( # group the Trainer arguments together parser.add_argument_group(title="pl.Trainer args") ) args = parser.parse_args() main (args)  

    all parameters

    Trainer. __init__ ( logg er = True , checkpoint_callback = True , callbacks = None , default_root_dir = None , gradient_clip_val = 0 , process_position = 0 , num_nodes = 1 , num_processes = 1 , gpus = None , auto_select_gpus = False , tpu_cores=None , log_gpu_memory=None , progress_bar_refresh_rate = None , overfit_batches = 0.0 , track_grad_norm = - 1 , check_val_every_n_epoch = 1 , fast_dev_run = False , accumulate_grad_batches = 1 , max_epochs = None , min_epochs = None , max_steps = None , min_steps = None , limit_train_batches=1.0 , lim it_val_batches = 1.0 , limit_test_batches = 1.0 , limit_predict_batches = 1.0 , val_check_interval = 1.0 , flush_logs_every_n_steps = 100 , log_every_n_steps = 50 , accelerator = None , sync_batchnorm = False , precision = 32 , weights_summary = 'top' , weights_save_path=None , num_sanity_val_ steps = 2 , truncated_bptt_steps = None , resume_from_checkpoint = None , profiler = None , benchmark = False , deterministic = False , reload_dataloaders_every_epoch = False , auto_lr_find = False , replace_sampler_ddp = True , terminate_on_nan = False , auto_scale_batch_size=False , prepare_ data_per_node = True , plugins = None , amp_backend = 'native' , amp_level = 'O2' , distributed_backend = None , move_metrics_to_cpu = False , multiple_trainloader_mode = 'max_size_cycle' , stochastic_weight_avg = False )

    Log and the return loss in the end do

    To add a training loop use the training_step method.

      class LitClassifier(pl.LightningModule): def __init__(self, model): super().__init__() sel f.model = model def training_step(self, batch, batch_idx): x, y = batch y_hat = self.model(x) loss = F.cross_entropy(y_hat, y) return loss  

    either training_step ,Still validation_step , test_step return values ​​are all loss . The returned loss will be collected in a list.

    Under the hood, Lightning does the following (pseudocode):

      # put model in train modemodel.train()torch.set_grad_enabled(True)losses = []for batch in train_dataloader: # forward loss = training_step (batch) losses.append(loss.detach()) # backward loss.backward() # apply and clear grads optimizer.step() optimizer.zero_grad()  

    Training epoch-level metrics

    If you want to calculate epoch-level metrics and log them, use the .log method.

      def training_step(self, batch, batch_idx): x, y = batch y_hat = self.model(x) loss = F.cross_entropy(y_hat, y) # logs metrics for each training_step, # and the average across the epoch, to the progress bar and logger self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True , logger=True) return loss  

    If the .log() function is used in theThis amount will then be recorded gradually. Each log outgoing variables will be recorded, each step will generate a dictionary dict, and each epoch will collect these dictionaries to form a dictionary of list.

    The .log object automatically reduces the requested metrics across the full epoch. Here's the pseudocode of what it does under the hood:

      outs = []for batch in train_dataloader: # forward out = training_step(val_batch) # backward loss.backward() # apply and clear grads optimizer.step() optimizer.zero_grad()epoch_metric = torch.mean(torch.stack([x['train_loss'] for x in outs]))  

    Train epoch-level operations

    If you need to do something with all the outputs of each training_step, override training_epoch_end yourself.

      def training_step(self, batch, batch_idx): x, y = batch y_hat = self.model( x) loss = F.cross_entropy(y_hat, y) preds = ... return {'loss': loss, 'other_stuff': preds}def training_epoch_end(self, training_step_outputs): for pred in training_step_outputs: # do something _co de70code 

    The matching pseudocode is:

      outs = []for batch in train_dataloader: # forward out = training_step(val_batch) # backward loss.backward() # apply and clear grads optimizer.step() optimizer.zero_grad ()training_epoch_end(outs)  

    DataModule

    Homepage: https://pytorch-lightning.readthedocs.io/en/latest/extensions/datamodules.html

    intro

    First,This DataModule does not conflict with the previously written Dataset at all. The former is a wrapper for the latter, and this wrapper can be used in multiple torch Datasets. In my opinion, its biggest role is to simply reuse repetitive codes such as various train/val/test divisions and DataLoader initialization through wrapper classes.

    specific role of the project:

    • Download instructions: Download
    • Processing instructions: processing
    • Split instructions: split
    • Train dataloader: training set Dataloader
    • Val dataloader(s): Validation set Dataloader
    • Test dataloader(s): Test set Dataloader

    Enhanced functions include:

    prepare_data(self) :

    • Operations (tokenize), etc.
    • Here is the function to prepare the data once and for all.
    • Since it is only called in a single thread, do not perform assignment operations like self.x=y in this function.
    • But if you use it yourself instead of distributing it to the public, this function may not need to be called, because the data is processed in advance.

    setup(self, stage=None)

    • Instantiate the dataset (Dataset), and perform related operations, such as: counting the number of classes, dividing the train/val/test set, etc. .
    • parameter stage is used to indicate whether it is in training period ( fit ) or test period ( _ span1 _span),Among them, the fit cycle requires the construction of both train and val datasets.
    • The setup function does not require a return value. The initialized train/val/test set can be directly assigned to self.

    train_dataloader/val_dataloader/test_dataloader :

    • initialization DataLoader
    • returns a DataLoader volume.

    Example

      class MNISTDataModule(pl.LightningDataModule): def __init__(self, data_dir: str = './', batch_size: int = 64, num_workers: int = 8): super().__init__( ) self.data_dir = data_dir self.batch_size = batch_size self.num_workers = num_workers self.transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) # self. dims is returned when you call dm.size() # Setting default dims here because we know them. # Could optionally be assigned dynamically in dm.setup() self.dims = (1, 28, 28) self.num_classes = 10 def prepare_data(self): # download MNIST(self.data_dir, train=True, download=True) MNIST(self.data_dir, train=False, download=True) def setup(self, stage=None): # Assign train/val datasets for use in dataloaders if stage == 'fit' or stage is None: mnist_full = MNIST(self.data_dir, train=True, transform=self.transform) self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000]) # Assign test dataset for use in dataloader(s) if stage == 'test' or stage is None: self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform) def train_dataloader(self): return DataLoader(self.mnist_train, batch_size=self.batch_size, num_workers=self.num_workers) def val_dataloader(self): return DataLoader(self.mnist_val, batch_size=self.batch_size, num_workers=self.num_workers) def test_dataloader(self): return DataLoader(self.mnist_test, batch_size=self.batch_size, num_workers=self.num_workers)  

    gist

    If a self.dims variable is defined in the DataModule,You can get this variable later by calling dm.size() .

    Saving and Loading

    Homepage: https://pytorch-lightning.readthedocs.io/en/latest/common/weights_loading.html

    Saving

    ModelCheckpoint Address: https://pytorch-lightning. readthedocs.io/en/latest/extensions/generated/pytorch_lightning.callbacks.ModelCheckpoint.html%23pytorch_lightning.callbacks.ModelCheckpoint

    ModelCheckpoint: Automatically stored callback module. By default, only the latest model and related parameters are automatically stored during the training process, and users can customize it through this module. For example, observe the amount of a val_loss , and store the top 3 models, and also store the models of the last epoch, and so on. Example:

      from pytorch_lightning.callbacks import ModelCheckpoint# saves a file like: my/path/sample-mnist-epoch=02-val_loss=0.32.ckptcheckpoint_callback = ModelCheckpoint( monitor='val_loss', filename='sample-mnist -{epoch:02d}-{val_loss:.2f}', save_top_k=3, mode='min', save_last=True)trainer = pl.Trainer(gpus=1, max_epochs=3, progress_bar_refresh_rate=20, callbacks=[ checkpoint_callback])  
    • Also,You can also store checkpoint manually: trainer.save_checkpoint("example.ckpt")
    • ModelCheckpoint Callback, if save_weights_only =True (equivalent to model.save_weights(filepath) ), otherwise it will save the entire model (equivalent to model.save(filepath) ).

    Loading

    Load a model, including its weights, biases and hyperparameters:

      model = MyLightingModule.load_from_checkpoint(PATH)print(model.learning_rate)# prints the learning_rate you used in this checkpointmodel. eval()y_hat = model(x)  

    Replace some hyperparameters when loading the model:

      class LitModel(LightningModule): def __init__(self, in_dim, out_dim): super().__init__() self. save_hyperparameters() self.l1 = nn.Linear(self.hparams.in_dim, self.hparams.out_dim)# if you train and save the model like this it will use these values ​​when loading# the weights. But you can overwrite thisLitModel( in_dim=32, out_dim=10)# uses in_dim=32, out_dim=10model = LitModel.load_from_checkpoint(PATH)# uses in_dim=128, out_dim=10model = LitModel.load_from_checkpoint(PATH, in_dim=128, out_dim=10)  

    full load training state: load includes everything about the model,And all parameters related to training, such as model, epoch, step, LR schedulers, apex , etc.

      model = LitModel() trainer = Trainer(resume_from_checkpoint='some/path/to/ my_checkpoint.ckpt')# automatically restores model, epoch, step, LR schedulers, apex, etc...trainer.fit(model)  

    Callbacks

    Callback is a self-contained program that can be interleaved with the training process together, without contaminating the main research logic.

    Callback is not only called at the end of epoch. pytorch-lightning provides dozens of hooks (interfaces, call locations) to choose from, and you can also customize callbacks to implement any module you want to implement.

    The recommended way to use is to write these functions into the lightning module for operations that change with problems and projects, and for relatively independent, relatively auxiliary, content that needs to be reused, you can define a separate module for subsequent convenience. Plug and unplug use.

    Callbacks Recommended

    Built-in Callbacks: https://pytorch-lightning.readthedocs.io/en/latest/extensions/callbacks.html%23built-in-callbacks

    EarlyStopping(monitor=' early_stop_on', min_delta=0.0, patience=3, verbose=False, mode='min', strict=True) : According to a certain value,Stop training early if there is no improvement for several epochs.

    Parameters:
    monitor (str) – quantity to be monitored. Default: 'early_stop_on'. as an improvement, ie an absolute change of less than min_delta, will count as no improvement. Default: 0.0.
    patience (int) – number of validation epochs with no improvement after which training will be stopped. Default : 3.
    verbose (bool) – verbosity mode. Default: False.
    mode (str) – one of 'min', 'max'. In 'min' mode, training will stop when the q uantity monitored has stopped decreasing and in 'max' mode it will stop when the quantity monitored has stopped increasing.
    strict (bool) – whether to crash the training if monitor is not found in the validation metrics. Default : True

    example:.

      from pytorch_lightning import Trainerfrom pytorch_lightning.callbacks import EarlyStoppingearly_stopping = EarlyStopping ( 'val_loss') trainer = Trainer (callbacks = [early_stopping])  

    ModelCheckpoint : See above Saving and Loading . PrintTableMetricsCallback : Print a result collation table after each epoch.

      from pl_bolts.callbacks import PrintTableMetricsCallbackcallback = PrintTableMetricsCallback() trainer = pl.Trainer(callbacks=[callback])trainer.fit(...)# --------------- ---------------# at the end of every epoch it will print# ------------------------ ------# loss│train_loss│val_loss│epoch# ────────────────────────────────# 2.2541470527648926│2.2541470527648926│ 2.2158432006835938│0  

    Logging

    Logging: Logger is TensorBoard by default, but various mainstream Logger frameworks can be specified, such as Comet.ml, MLflow, Netpune, or direct CSV files. Multiple loggers can be used at the same time.

      from pytorch_lightning import loggers as pl_loggers# Defaulttb_logger = pl_loggers.TensorBoardLogger( save_dir=os.getcwd(), version=None, name='lightning_logs')trainer = Trainer(logger=tb_logger)# Or use the same format as otherstb_logger = pl_loggers.TensorBoardLogger('logs/')# One Loggercomet_logger = pl_loggers.CometLogger(save_dir='logs/')trainer = Trainer(logger=comet_logger)# Save code snapshotlogger = pl_loggers.TestTubeLogger('logs/', create_git_tag= True)# Multiple Loggertb_logger = pl_loggers.TensorBoardLogger('logs/')comet_logger = pl_loggers.CometLogger(save_dir='logs/')trainer = Trainer(logger=[tb_logger, comet_logger])  

    By default,Once every 50 batch logs, the parameters can be adjusted.

    If you want to log and output non-scalar (scalar) content, such as pictures, text, histograms, etc., you can directly call self.logger.experiment.add_xxx() to achieve the desired operate.

      def training_step(...): ... # the logger you used (in this case tensorboard) tensorboard = self.logger.experiment tensorboard.add_image() tensorboard.add_histogram(...) tensorboard.add_figure( ...)  

    use log: if TensorBoard, then: tensorboard --logdir ./lightning_logs . In Jupyter Notebook, you can use:

      # Start tensorboard.%load_ext tensorboard%tensorboard --logdir lightning_logs/  

    to open TensorBoard inline.

    • Tip: If TensorBoard is enabled in the LAN, add flag --bind_all to access by hostname:
      tensorboard --logdir lightning_logs > `http://SERVER-NAME:6006/  

    Transfer Learning

    Homepage: https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction_guide.html%23transfer-learning

      import torchvision.models as modelsclass ImagenetTransferLearning(LightningModule): def __init__(self): super().__init__() # init a pretrained resnet backbone = models.resnet50(pretrained=True) num_filters = backbone.fc.in_features layers = list (backbone.children())[:-1] self.feature_extractor = nn.Sequential(*layers) # use the pretrained model to classify cifar-10 (10 image classes) num_target_classes = 10 self.class ifier = nn.Linear(num_filters, num_target_classes) def forward(self, x): self.feature_extractor.eval() with torch.no_grad(): representations = self.feature_extractor(x).flatten(1) x = self.classifier (representations) ...  

    About device operation

    LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.

      # badt = torch.rand(2 , 2).cuda()# good (self is LightningModule)t = torch.rand(2, 2, device=self.device)  

    For tensors that need to be model attributes, it is best practice to register them as buffers in the modules' __init__ method:

      # badself.t = torch.rand(2, 2, device=self.device)# goodself.register_buffer("t", torch.rand (2, 2))  

    span1 The first two paragraphs of the span are the text in the tutorial.However, there is actually a dark pit:

    If you use a relayed pl.LightningModule , and this module instantiates a common nn.Module , and this model needs to generate some tensors internally, such as the mean, std, etc. of each channel of the picture, then if you pass a self.device from pl.LightningModule , actually at the beginning this self.device is always cpu . So if you initialize in the call to nn.Module __init__() ,Use to(device) or nothing at all, the result is that it is always on cpu .

    However, after the experiment, although pl.LightningModule in __init __ () stage self.device or cpu , when entering training_step() , it quickly becomes cuda . So, for submodules, the best solution is to use a forward The amount passed in,For example, x , as a reference variable, use type_as function to place all the tensors generated in the model on the same device as this reference variable.

      class RDNFuse(nn.Module): ... def init_norm_func(self, ref): self.mean = torch.tensor(np.array(self.mean_sen), dtype=torch.float32).type_as(ref ) def forward(self, x): if not hasattr(self, 'mean'): self.init_norm_func(x)  

    Points

    pl.seed_everything(1234) : for all relevant random quantities Fixed seeds.

    When using LR Scheduler, you don't need to do it yourself .step() . It is also handled automatically by Trainer.

    related interface: https://pytorch-lightning.readthedocs.io/en/latest/common/optimizers.html%3Fhighlight%3Dscheduler%23

      # Single optimizerfor epoch in epochs: for batch in data: loss = model.training_step(batch, batch_idx, ...) loss.backward() optimizer.step() optimizer.zero_grad() for scheduler in schedulers: scheduler.step() # Multiple optimizers for epoch in epochs: for batch in data : for opt in optimizers: disable_grads_for_other_optimizers() train_step(opt) opt.step() for scheduler in schedulers: scheduler.step()  

    The method for dividing train and val sets. Not related to PL, but very commonly used, two examples: random_split(range(10), [3, 7], generator=torch.Generator().manual_seed(42))

    as follows:

      from torch.utils.data import DataLoader, random_splitfrom torchvision.datasets import MNISTmnist_full = MNIST(self.data_dir, train=True, transform=self.transform)self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000 ])  

    Parameters:
    dataset (https://pytorch.org/docs/stable/data.html%23torch.utils.data.Dataset) – Dataset to be split
    lengths – lengths of splits to be produced
    generator (https://pytorch.org/docs/stable/generated/torch.Generator.html%23torch.Generator) – Generator used for the random permutation.

    .

    hotcomm Category Latest News