All kinds of pytorch backward
Background in pytorch
Tensor and torch.autograd.Function are interconnected and build up an acyclic graph, that encodes a complete history of computation. The computation graph is interconnected by grad_fn attribute in pytorch.
https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
torch.nn.functional
Typical functional can be found at https://pytorch.org/docs/stable/nn.functional.html.
The functional versions are stateless, and called directly, eg for softmax, which has no internal state.
x = torch.nn.functional.soft_max(x)
There are functional versions of various stateful network modules. In this case, you have to pass in the state yourself. Conceptually, for Linear, it’d be something (conceptually) like:
x = torch.nn.functional.linear(x, weights_tensor)
The difference between torch.nn and torch.nn.functional is a matter of convenience and taste. torch.nn is more convenient for methods which have learnable parameters.
All modules you used in pytorch, if we trace them back inside the python level, they are all actually invoking different kinds of torch.nn.functional methods. you can check a demo here:
https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html
In this example, MyReLU, mm, pow, sum.
Typical invoke order:
- torch.nn->torch.nn.functional->torch.tensor.ops->torch._C.nn.ops
- torch.nn->torch.tensor.ops->torch._C.nn.ops
Typical implementation(invoke) pattern in torch.nn.functional:
- Some directly invoke torch.tensor.ops, such as torch.nn.functional.sigmoid
- Some will invoke C++ code, such as torch.nn.functional.conv1d
- Some will invoke tensor.ops, such as torch.nn.functional.relu
- Some will invoke torch._C._nn.ops, such as torch.nn.functional.glu
- Some will directly be implemented in torch.nn.functional, such as torch.nn.functional.gumbel_softmax
torch.autograd.Function
https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function
https://pytorch.org/tutorials/beginner/examples_autograd/two_layer_net_custom_function.html
Atomic Operation
typical grad_fn is: AddBackward,PowBackward,PowBackward,AddmmBackward,DivBackward, etc.
For example:
z =x+1
print(z.grad_fn)
#AddBackward
y = torch.nn.Linear(2, 2)(x)
print(y,y.grad_fn)
#AddmmBackward
Here grad_fn is AddmmBackward, because Addmm is the last atomic operation in ``Linear’’. A similar case is:
y = torch.nn.functional.cosine_similarity(x,x)
print(y.grad_fn)
#DivBackward0
Why sometimes backward is unneccesary in function implementation?
When we write a atomic operation, we need a backward function to realize autograd. Sometimes, a function doesn’t need backward implementation, because:
- This function is a high-level(non-atomic) function, because this function can be realized by other differentiable atomic operations(functions). such as typical Conv1D,Linear operation.
- This function doesn’t have the input for differential.
- This function doesn’t have the output for differential.
torch.nn.CosineSimilarity in pytorch id differentiable, see math prove. Notice, pytorch only is responsible for gradient computing, this is different from optimization, they can not gurantee your model can be optimized as you wish.
Inplace=True
inplace=True means that it will modify the input directly, without allocating any additional output(extra node/vertex in computation graph). It can sometimes slightly decrease the memory usage, but may not always be a valid operation (because the original input is destroyed). However, if you don’t see an error, it means that your use case is valid.
DDP
Difference about whether using DistributedSampler in DDP
https://www.codenong.com/cs106162971/
different backward function in pytorch
A useful tool to create visualizations of PyTorch execution graphs and traces: pytorchviz
torch.nn.functional.soft_max
this is a mapping function from \(B \times C\) to \(B \times C\), the derivation is splittable along all dimensions, therefore we can simply consider the scalar case, consider \(y_{i}=\frac{exp(x_{i})}{\sum_{j \in C} exp(x_{j})}\)
\[\frac{\partial{y_{i}}}{\partial{x_{i}}} = y_{i}^{2} - y_{i}\]You can also illustrate it by \(x_{i}\) by replacing \(y_{i}\) to \(x_{i}\), the vector form is:
\[\frac{\partial{Y}}{\partial{X}} = Y^{2} - Y=(sigmoid(X))^{2} - sigmoid(X)\]torch.log
\[y_{i}=log_{e} (x_{i})\]the gradient is
\[\frac{\partial{y_{i}}}{\partial{x_{i}}} = \frac{1}{x_{i}}\]nn.LogSoftmax
\[y_{i}=log(softmax(z_{i})) = log(\frac{exp(x_{i})}{\sum_{j} exp(x_{j})})\]This is a combination of nn.LogSoftmax and torch.nn.functional.soft_max, use chain rule can solve it.
torch.NLLLoss
torch.NLLLoss + nn.LogSoftmax = CrossEntropy Loss
torch.nn.functional.cross_entropy
\[L= - \sum_{j}^{C} y_{j} log(softmax(x_{j})) = - \sum_{j}^{C} y_{j} log(\frac{exp(x_{y})}{\sum_{m} exp(x_{m})})\]let \(z_{j}= \frac{exp(x_{y})}{\sum_{m} exp(x_{m})}\)
\(\frac{\partial{L}}{\partial{x_{j}}} = z_{j}-1\) if \(y_{j}=1\), else \(\frac{\partial{L}}{\partial{x_{j}}} = z_{j}\)
check here
torch.cat
concatenation can be operated between diverse dimensions. As here we only reorganize the tensor and don’t change the value, all gradients are just the reverse-organsed(with value unchanged).
how gradient bp when using torch.no_grad() in the middle of network?
https://arxiv.org/pdf/2006.09882.pdf
torch.mean
Convolution is equivalent with Unfold + Matrix Multiplication + Fold (or view to output shape)
https://colab.research.google.com/drive/10aVzydfnBYWJKXe5SCTxGx0t529kufRd
Conv1d and nn.Linear
1x1 Conv2d and Linear
Can Fully Connected Layers be Replaced by Convolutional Layers?
softmax implementation numerical stability
Imagine exp(1e20)+exp(1e10)
import numpy as np
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
scores = [3.0, 1.0, 0.2]
print(softmax(scores))