Inversible network
https://zhuanlan.zhihu.com/p/73426787
Related materials
check Normalizing Flows for Probabilistic Modeling and Inference
Normalizing Flows: An Introduction and Review of Current Methods
Invertible Models and Normalizing Flows: a retrospective (ICLR 2020 keynote slides)
density estimation, variational inference, sample(generation) is essentially different and corelated.
Pre history: distribution estimation
Gaussianization brief read
Independent Component Analysis
NADE:Neural Autoregressive Distribution Estimation,JMLR2000
Recent Advance
Auto-regressive model
NADE:Neural Autoregressive Distribution Estimation,JMLR2000
Pixel Recurrent Neural Networks,ICML16
Furthermore, in contrast to previous approaches that model the pixels as continuous values (e.g., Theis & Bethge (2015); Gregor et al.(2014)), we model the pixels as discrete values using a multinomial distribution implemented with a simple softmax layer. Each channel variable xi,∗ simply takes one of 256 distinct values.
We have four types of networks: the PixelRNN based on Row LSTM, the one based on Diagonal BiLSTM, the fully convolutional one and the MultiScale one.
Have a detailed discussion about dequantizing the image data.
In the literature it is currently best practice to add realvalued noise to the pixel values to dequantize the data when using density functions (Uria et al., 2013). When uniform noise is added (with values in the interval [0, 1]), then the log-likelihoods of continuous and discrete models are directly comparable (Theis et al., 2015).
Evaluation details: For MNIST we report the negative log-likelihood in nats as it is common practice in literature. For CIFAR-10 and ImageNet we report negative log-likelihoods in bits per dimension. The total discrete log-likelihood is normalized by the dimensionality of the images (e.g., 32 × 32 × 3 = 3072 for CIFAR-10). These numbers are interpretable as the number of bits that a compression scheme based on this model would need to compress every RGB color value (van den Oord & Schrauwen, 2014b; Theis et al., 2015); in practice there is also a small overhead due to arithmetic coding.
PixelCNN:Conditional Image Generation with PixelCNN Decoders
MADE:Masked Autoencoder for Distribution Estimation,ICML15
MAF: Masked Autoregressive Flow for Density Estimation,NeurIPS17
Based on MADE
Difference between previous methods:
An early example is Gaussianization [4], which is based on successive application of independent component analysis. Enforcing invertibility with nonsingular weight matrices has been proposed [3, 29], however in such approaches calculating the determinant of the Jacobian scales cubicly with data dimensionality in general. Planar/radial flows [27] and Inverse Autoregressive Flow (IAF) [16] are models whose Jacobian is tractable by design. However, they were developed primarily for variational inference and are not well-suited for density estimation, as they can only efficiently calculate the density of their own samples and not of externally provided datapoints. The Non-linear Independent Components Estimator (NICE) [5] and its successor Real NVP [6] have a tractable Jacobian and are also suitable for density estimation.
Check session for detailed “Relationship with Inverse Autoregressive Flow”.
The advantage of Real NVP compared to MAF and IAF is that it can both generate data and estimate densities with one forward pass only, whereas MAF would need D passes to generate data and IAF would need D passes to estimate densities.
why?
Have a detail comparison beteen MADE,IAF,MAF
(IAF)Improved Variational Inference with Inverse Autoregressive Flow,NIPS16
Preliminary: PixelCNN , PixelRNN,MADE
The paper are able to exploit the recent advances in autoregressive models, particularly in making efficient inference through parallel computing. However, they avoid the cumbersome sampling/inversion procedure of autoregressive model, which is quite ingenious.
\(z = \sigma \odot z + (1-\sigma) \odot m\) is parallelized, this is the main difference between autoregressive model.
Perhaps the simplest special version of IAF is one with a simple step(T=1), and a linear autoregressive model. This transforms a Gaussian variable with diagonal covariance, to one with linear dependencies, i.e. a Gaussian distribution with full covariance. See appendix A for an explanation.
We found that results improved when reversing the ordering of the variables after each step in the IAF chain.
Why sampling speed is so high compared with PixelCNN?TODO
Fig 5 in supp,TODO.
Temperature is important in generation, this point also inspires Glow:
Across all of the presented experiments, we use categorical sampling during decoding with a tempered softmax (Dahl et al., 2017). We adjust the concentration of the distribution we sample from with a temperature τ > 0 by which we divide the logits for the channel intensities.
Block Neural Autoregressive Flow,UAI19
Axial Attention in Multidimensional Transformers
Normalization Flow
Planar Flow: \(g(x) = x + u h(w^{T}x +b)\)
An extension based on Planar flow is Sylvester flow: \(g(x) = x + U h(W^{T}x +b)\) where U and W are DxM matrices
Radial Flows:
\[g(x) = x + \frac{\beta}{\alpha + ||x-x_{0}||}(x-x_{0})\]NICE: NON-LINEAR INDEPENDENT COMPONENTS ESTIMATION,ICLRW15
core idea of coupling layer(actually also proposed a genneral coupling layer, while they use additive coupling layer for simplicity.):
\[y_{1} =x_{1}\\ y_{2} = x_{2} + m(x_{1})\]m can be as complex as you need, I like this idea, why the fucking ICLR reject it? Also this paper is honest compared with common papers. show the simple intuition in the very beginning.
Examining the Jacobian, we observe that at least three coupling layers are necessary to allow all dimensions to influence one another. We generally use four.
Prior distribution can be gaussian distribution or logistic distribution. Their prior distribution can be explicitly expressed in session3.4(EXCERCISE)
Difference between VAE: Like the variational auto-encoders, the NICE model uses an encoder to avoid the difficulties of inference, but its encoding is deterministic. The log-likelihood is tractable and the training procedure does not require any sampling (apart from dequantizing the data).
SCALING intuition: As each additive coupling layers has unit Jacobian determinant (i.e. is volume preserving), their composition will necessarily have unit Jacobian determinant too.(TODO)This allows the learner to give more weight (i.e. model more variation) on some dimensions and less in others. similar to attention mechanism recently.
The INPAINTING application is interesting, a super simple projected gradient ascent is applied based on the pre-trained combination probability between H and O.
The change of variable formula for probability density functions is prominently used, check related works in this paper.
The NICE criterion is very similar to the criterion of the variational auto-encoder. More specifically, as the transformation and its inverse can be seen as a perfect auto-encoder pair,… check related work.TODO
Density estimation using Real NVP,ICLR17
Contributions: affine coupling layer, masked convolution, multi-scale architecture(squeeze out), introduce moving-average batch normalization into this topic.
Training a normalization flow does not in theory requires a discriminator network as in GANs, or approximate inference as in variational autoencoders. If the function is bijective, it can be trained through maximum likelihood using the change of variable formula. This formula has been discussed in several papers including the maximum likelihood formulation of independent components analysis (ICA) [4, 28], gaussianization [14, 11] and deep density models [5, 50, 17, 3].
dive deeper into related works. TODO.
About the nature of maximum likelihood:
As mentioned in [62, 22], maximum likelihood is a principle that values diversity over sample quality in a limited capacity setting.
Glow: Generative Flow with Invertible 1x1 Convolutions
Summairzed four merits of flow-based generative models.
ActNorm is similar to BN, without mean and standard deviation. only learn the scale and bias with size \(C\times 1\times 1\), interesting thing is you only know how to initialize until first batch of data arrives.
An additive coupling layer proposed before is a special case with s = 1 and a log-determinant of 0 in affine coupling layers. Actually NICE also proposed a general coupling layer. So what’s the difference between glow’s coupling layer and the general coupling layer in NICE?
invertable 1x1 convolution by LU decomposition, TODO.
Temperature T is vital in n likelihood-based generative models.
RevNets:The Reversible Residual Network: Backpropagation Without Storing Activations
intuition: present the Reversible Residual Network (RevNet), a variant of ResNets where each layer’s activations can be reconstructed exactly from the next layer’s. Therefore, the activations for most layers need not be stored in memory during backpropagation.
i-revnet’s comment:
RevNets illustrate how to build invertible ResNet-type blocks that avoid storing intermediate activations necessary for the backward pass. However, RevNets still employ multiple non-invertible operators like max-pooling and downsampling operators as part of the network. As such, RevNets are not invertible by construction. In this paper, we show how to build an invertible type of RevNet architecture that performs competitively with RevNets on Imagenet, which we call i-RevNet for invertible RevNet.
TODO: how to map the input to a categorial outpout(softmax)? need check code.
how to do downsampling? spatial pooling as traditional methods.
Note that unlike residual blocks, reversible blocks must have a stride of 1 because otherwise the layer discards information, and therefore cannot be reversible. Standard ResNet architectures typically have a handful of layers with a larger stride. If we define a RevNet architecture analogously, the activations must be stored explicitly for all non-reversible layers.
Splitting is based on channel dimension.
check footnote 2 in page 4, you can feel the grid-searching is labor-consuming.
i-REVNET: DEEP INVERTIBLE NETWORKS,ICLR18
smart idea: It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet.
The design is similar to the Feistel cipher diagrams (Menezes et al., 1996) or a lifting scheme (Sweldens, 1998), which are invertible and efficient implementations of complex transforms like second generation wavelets.
In this way, we avoid the non-invertible modules of a RevNet (e.g. max-pooling or strides) which are necessary to train them in a reasonable time and are designed to build invariance w.r.t. translation variability.
Downsampling is motivated by sub-pixel convolution layer, check Fig2, quite easy for understanding. kind of “pixel shuffle”.
coefficients=#params….
why the parameter of injective version(a) is much larger than bijective version(b)? because for (a) the channel number with depth increasing is: 48, 192, 768 and 3072. For (b) is 24, 96, 384, 1536.
“ANALYSIS OF THE INVERSE”, experimental parts:
- This indicates Φ linearizes the space locally in a considerably smaller space in comparison to the original input dimensions, because the original dimension maybe 10^{8}. still need to do here.
- “progressive linear separation and contraction” is inspired by Building a Regular Decision Boundary with Deep Networks,CVPR17.
mutual information is preserved because of invertibility:
In this paper, the authors propose deep architecture that preserves mutual information between the input and the hidden representation and show that the loss of information can only occur at the final layer.
A important double: Why is such a model desirable?
The core question we answer is if the success of deep convolutional networks is based on progressively discarding uninformative variability, which is a wide standing believe in the CV and ML community. We show this does not have to be the case, which has been acknowledged as “important”, “interesting” and “thought-provoking” by all reviewers. Thus, the invertibility property is desirable for understanding the success of deep learning better and shed light on some of the necessities for it to work well. From a practical point of view, invertible models are useful for feature visualization [1,2,3] and possibly useful to overcome difficulties in upsampling/decoding pixel-wise tasks that are still quite challenging [4]. Further, lossless models might be a good candidate for transfer learning.
Section 5 shows that even when using either an SVM or a Nearest Neighbor classifier on n extracted features from a layer in the network, both classifiers progressively improve with deeper layers. When the d first principal components are used to summarize the n extracted features, the SVM and NN classifier performs better when d is bigger. This shows that the deeper the network gets, the more linearly separable and contracted the learned representations are.
In the conclusion, the authors state the following: “The absence of loss of information is surprising, given the wide believe, that discarding information is essential for learning representations that generalize well to unseen data”. Indeed, the authors have succeed in showing that this is not necessarily the case. However, the loss of information might be necessary to generalize well on unseen data and at the same time minimize the parameter budget for a given classification task.
Benchmarking Invertible Architectures on Inverse Problems
Identifying through Flows for Recovering Latent Representations,ICLR20
Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings,ICLR20
Set Flow: A Permutation Invariant Normalizing Flow,Arxiv1909
Multi-variate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows,Arxiv2002
Graph Normalizing Flows,Arxiv1905
Your classifier is secretly an energy based model and you should treat it like one,ICLR20,oral
This paper advocates the use of energy based models (EBMs) to help realize the potential of generative models on downstream discriminative problems.
Variational autoencoders and nonlinear ICA: A unifying framework,AISTAT20
Preliminary work: Nonlinear ICA Using Auxiliary Variables,AISTAT19
check conclusion part.
The framework of variational autoencoders allows us to efficiently learn deep latent-variable models, such that the model’s marginal distribution over observed variables fits the data. Often, we’re interested in going a step further, and want to approximate the true joint distribution over observed and latent variables, including the true prior p(z) and posterior p(z|x) distributions over latent variables. This is known to be generally impossible due to unidentifiability of the model.
The VAE model actually learns a full generative model \(p_{\theta}(x,z) = p_{\theta}(x|z)p_{\theta}(z)\) and an inference model \(q_{\theta}(z|x)\) that approximates its posterior \(p_{\theta}(z|x)\) The problem is that we generally have no guarantees about what these learned distributions actually are: all we know is that the marginal distribution over x is meaningful (Eq. 3). The rest of the learned distributions are, generally, quite meaningless.
Almost no literature exists on achieving this goal. A pocket of the VAE literature works towards the related goal of disentanglement, but offers no proofs or theoretic guarantees of identifiability of the model or its latent variables.
The prior on the latent variables pθ(z|u) is assumed to be conditionally factorial, where each element of zi ∈ z has a univariate exponential family distribution given conditioning variable u. To this end, in practice we choose the prior pθ(z|u) to be a Gaussian location-scale family, which is widely used with VAE.
combine with self-attention
When and How Can Deep Generative Models be Inverted?,Arxiv2006
Riemannian Continuous Normalizing Flows,Arxiv2006
Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN),ICLR20
repeat
Categorical Reparameterization with Gumbel-Softmax,ICLR17
Gaussianization Flows,Arxiv2003
interesting.
Flow Contrastive Estimation of Energy-Based Models,Arxiv1912
ICE-BeeM: Identifiable Conditional Energy-Based Deep Models,Arxiv2002
Analyzing Inverse Problems with Invertible Neural Networks,ICLR19
intuition:
INNs circumvent a fundamental difficulty of learning inverse problems: Defining a sensible supervised loss for direct posterior learning is problematic since it requires prior knowledge about that posterior’s behavior, constituting a kind of hen-end-egg problem. If the loss does not match the possibly complicated (e.g. multimodal) shape of the posterior, learning will converge to incorrect or misleading solutions.
Related-work part is quite intriguing.
A few minor changes:
if the dimension D is small, but a complex transformation has to be learned, we find it advantageous to pad both the in- and output of the network with an equal number of zeros. This does not change the intrinsic dimensions of in- and output, but enables the network’s interior layers to embed the data into a larger representation space in a more flexible manner. Secondly, we insert permutation layers between reversible blocks, which shuffle the elements of the subsequent layer’s input in a randomized, but fixed, way. This causes the splits u = [u1, u2] to vary between layers and enhances interaction among the individual variables. Kingma and Dhariwal (2018) use a similar architecture with learned permutations
As a side effect, our results confirm the findings of others that the restriction to coupling layers does not noticeably reduce the expressive power of the network.
In the appendix, Sec. 1, we prove that Lx is guaranteed to be zero when the forward losses Ly and Lz have converged to zero. Thus, incorporating Lx does not alter the optimum, but improves convergence in
Two toy datasets is interesting.
How is 8-gaussian toy dataset used?
Lx,MMD; Ly=MCE,Lz=MMD
How MMD is implemented?
suppp need more reading.
baseline need more in-depth reading.
Sylvester Normalizing Flows for Variational Inference,UAI18
BayesFlow: Learning complex stochastic models with invertible neural networks,Arxiv2003
Do Deep Generative Models Know What They Don’t Know?,ICLR19
FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models,ICLR19
TODO: how to get vector field in toy_demo
Structured Output Learning with Conditional Generative Flows,AAAI20
direct, intuitive,done. based on binary segmentation task.
PointFlow: 3D Point Cloud Generation with Continuous Normalizing Flows,ICCV19
C-Flow: Conditional Generative Flow Models for Images and 3D Point Clouds,ICCV19
Hybrid Models with Deep and Invertible Features,ICML19
We are unaware of any work that uses normalizing flows as the generative component of a hybrid model. The most related work is the class conditional variant of Glow (Kingma& Dhariwal, 2018, Appendix D).
How is Eq (3),(10) come from?
Invert to Learn to Invert,NIPS19
Solve invert problem iteratively, based on previous works RIM.
motivation from normalization flow,apply reversible layer into RIM.
Based on fastMRI task.
ACFlow: Flow Models for Arbitrary Conditional Likelihoods,ICML20
Invertible Zero-Shot Recognition Flows,ECCV20
A Disentangling Invertible Interpretation Network for Explaining Latent Representations,Arxiv2004
IIR Invertible Image Rescaling,ECCV20,oral
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,Arxiv2003
Integer Discrete Flows and Lossless Compression,NeurIPS19
intuition the latent variable is discrete?
invertible interpretability
Inverting Visual Representations with Convolutional Networks,CVPR16
Summary
discrete or continuous
In Glow, they mentioned if x is discrete data, the log-likelihood objective is simply as:
\[L(D) = \frac{1}{N} \sum_{i=1}^{N} -logp_{\theta}(x^{(i)})\]If x is a continuous data(narual images are in this case, therefore we need dequantization. check PixelRNN):
\(L(D) = \frac{1}{N} \sum_{i=1}^{N} - logp_{\theta}(\tilde{x}^{(i)}) + c\)In Glow, they mentioned if x is discrete data, the log-likelihood objective is simply as:
\[L(D) = \frac{1}{N} \sum_{i=1}^{N} -logp_{\theta}(x^{(i)})\]If x is a continuous data(narual images are in this case, therefore we need dequantization. check PixelRNN):
\[L(D) = \frac{1}{N} \sum_{i=1}^{N} - logp_{\theta}(\tilde{x}^{(i)}) + c\]Prior choice summary
The essential problem is how to obtain p(z) if you know the value of z, a necesary prior is you need the distribution type of z!
In realNVP, they set p(x) to be an isotropic unit norm Gaussian.
In NICE, the prior distribution p(x) can be gaussian distribution:
\[log(p(x)) = -\frac{1}{2} (x^{2} + log(2\pi))\]or logistic distribution:
\[log(p(x)) = -log(1+exp(x)) - log(1+exp(-x))\]They tend to use the logistic distribution as it tends to provide a better behaved gradient.
How about Glow?
In PixelRNN,they use discrete categorical distribution formulated by softmax.
In Image-Transformer:
experiment with two settings of the distribution: a categorical distribution across each channel (van den Oord et al., 2016a) and a mixture of discretized logistics over three channels (Salimans et al.). [the categorical distribution is a special case of the multinomial distribution, in that it gives the probabilities of potential outcomes of a single drawing rather than multiple drawings.]
As suggested by NICE, the prior distribution is factorial. we can simply multiply the prior distribution of every dimension(or pixel), then take logarithm, which is equavalent of sum up of log of each dimension’s prior distribution.
In summary: the typical choice is gaussian, logistic distribution
Pixel processing summary
In realNVP, they mentioned this:
In order to reduce the impact of boundary effects(the boundary effects here can be seen in Figure 6 of PixelRNN.), we instead model the density of logit(α+(1−α) x/256 ), where α is picked here as .05.
In PixelRNN, they also discussed quite a lot about the continous or discrete variable we should see the pixel as. They see them as a discrete variable and model them with a softmax layer(see Figure 6). But previous models are trained with continuous variables, here they show how they compare with previous works:
All our models are trained and evaluated on the loglikelihood loss function coming from a discrete distribution. Although natural image data is usually modeled with continuous distributions using density functions, we can compare our results with previous art in the following way. In the literature it is currently best practice to add realvalued noise to the pixel values to dequantize the data when using density functions (Uria et al., 2013). When uniform noise is added (with values in the interval [0, 1]), then the log-likelihoods of continuous and discrete models are directly comparable (Theis et al., 2015). In our case, we can use the values from the discrete distribution as a piecewiseuniform continuous function that has a constant value for every interval [i, i + 1], i = 1, 2, . . . 256. This corresponding distribution will have the same log-likelihood (on data with added noise) as the original discrete distribution (on discrete data).
In MAF:
For both MNIST and CIFAR-10, we use the same preprocessing as by Dinh et al. [6]. We dequantize pixel values by adding uniform noise, and then rescale them to [0, 1]. We transform the rescaled pixel values into logit space by x 7→ logit(λ + (1 − 2λ)x), where λ= 10−6 for MNIST and λ= 0.05 for CIFAR-10, and perform density estimation in that space. In the case of CIFAR-10, we also augment the train set with horizontal flips of all train examples (as also done by Dinh et al. [6]).
In Image-Transformer, they didn’t consider the difference between continuous and discrete data.
In RNADE:
Pixels in this dataset can take a finite number of brightness values ranging from 0 to 255. Modeling discretized data using a real-valued distribution can lead to arbitrarily high density values, by locating narrow high density spike on each of the possible discrete values. In order to avoid this ‘cheating’ solution, we added noise uniformly distributed between 0 and 1 to the value of each pixel. We then divided by 256, making each pixel take a value in the range [0, 1].
Evaluation metrics summary
for bits-per-dimension check page12 of MAF.
if your goal is density estimation(Glow, MAF,MADE,RealNVP,Sylvester):
- minimize NLL(in nats)
If your goal is vairiational inference, you can evaluate on ELBO, and NLL. To obtain NLL, you need importance sampling.
Sylvester flow: In order to obtain estimates for the negative log likelihood we used importance sampling (as proposed in (Rezende et al., 2014)).Unless otherwise stated, 5000 importance samples were used.
Coding
toy example,TODO