CSE 493 - Deep Learning

Course Website

  • This course is heavily based off of CS231n
  • This course will primarily focus on NLP and Computer Vision
  • Course is multiple parts:
    • Deep learning fundamentals
    • Practical training skills
    • Application
  • Vision has been one of the drivers of early DL, important to understand the history
  • Books can be helpful, but not necessary
  • Gradescope has automatic and private testing
  • Three psets, one midterm, and a final group project

Lecture 1 - March 28

  • 543 million years ago vision animals started to develop sight
  • Camera obscura developed in 1545 to study eclipses
    • Inspired da Vinci to create the pinhole camera
  • 1959 Hubel & Wiesel found that we visually react to “edges” and “blobs”
    • Think of this a “lower layer”
  • Larry Roberts is known as the “Father of Computer Vision” - wrote the first CV thesis
  • 1960’s MIT attempted to solve vision in a summer - this didn’t happen
  • David Marr introduced the idea of stages of visual recognition in 1970’s
  • Edge detection became the next big push in CV
  • In the 1980’s expert systems became popular
    • These had heuristic rules made by domain “experts”
    • Unsuccessful and caused the second AI Winter
  • Irving Biederman came up with rules on how we view the world
    • 1: We must understand components (objects and relationships)
    • 2: This is only possible because we see so many objects learning
      • A 6 year old child has seen 30,000 objects
  • We can detect animals in 150 ms
    • We detect predators and the color red even quicker!
  • Later-stage neurons allowed us to detect complex object or themes
  • In the 1990’s research started on start on real-world images
    • Algorithms were developed for grouping (1990’s) and matching (2000’s)
  • In 2001 the first commercial success in CV
    • Facial detection, used ML and facial features
  • In the 2000’s feature development was all the rage
    • Histogram of gradients - how do the edges in pixels move?
  • We need an incredible amount of data - led to ImageNet
    • 2009, had 22K categories and 14M images
  • In 2012 AlexNet had breakthrough performance on ImageNet
    • By 2015 all attempts were DL and better than humans
  • In 1957 the Mark I Perception was created for character recognition
    • Manually twisted knobs to tune (adjusted weights)
    • Cannot be trained practically
  • Backpropagation was developed in 1986
  • LeNet is the architecture used in the Postal Service - 1998
    • AlexNet is the same architecture
  • DL was used in the early 2000’s to compress images
  • Everything is homogenized now
    • Transformers and backprop are the norm
    • Data and compute are the differentiators
    • Domains change, but core is often the same
  • Hinton, Bengio, and LeCun won the Turing award in 2018
  • Deep learning is it’s own course because of incredible growth

Lecture 2 - March 30

  • Image classification (IC) is a core task in CV
  • There are many challenges related to computer vision:
    • Viewpoint variation
    • Illumination
    • Background clutter
    • Occlusion
    • Deformation
    • Intraclass variation
    • It is very difficult to implement an image classifier as “normal” software
      • “old” AI
    • Data-driven paradigm is better
      1. Use datasets to train a model
        • MNIST
        • CFIAR 10(0)
        • ImageNet
        • MIT Places
        • Omniglot
      2. Use ML to train a classifier
      3. Evaluate the classifier on new images
  • Nearest Neighbor classifier
    • Training: memorize all data and labels
    • Inference: predict label of “nearest” image
    • $O(1)$ train time, $O(n)$ inference time
    • It is a universal approximator with n -> infinity data points
  • What is “nearest” (distance)?
    • L1 norm is bad (Manhattan)
    • L2 norm is good (Euclidean)
  • Hyperparaamters are choices (configs) of the model
    • e.g. k, distance type
  • Finding hyperparams
    • We should use a train, val, and test ds
    • Cross validation: split data into folds - no dedicated val necessary
  • Curse of dimensionality: the number of data points necessary is related to the dimension of data
  • Linear classifiers take on the form: $y = Wx + b$
    • $b$ is often omitted, instead appending data vector with a one
    • Parametric approach
    • Learns a template and decision boundary

Lecture 3 - April 4

  • It’s cool that we have a classifier, but how do we make it good?
  • Loss function: define how good or bad our classifier (weights) are
  • Loss over dataset is the average of loss over examples
    • $x_i$, $y_i$, and $L_i$ are example data, output label, and example loss
  • Multiclass SVM loss:
    • Make sure that prediction of correct label is greater than all predictions for other labels by a given margin
    • Delta doesn’t matter because it will simply scale
    • Linear-ish loss
    • Issues because piecewise function, doesn’t approach zero asymptotically
  • Squaring losses leads to predictions being penalized by more by a factor of incorrect-ness
  • Regularization is important, we always want the simplest model (no overfitting!)
    • Often adding a fraction of the L1 or L2 on the weights.
    • Spread out the weights
  • Softmax classifier:
    • Pushes each value between 0 and 1
    • Ensures the sum of the softmaxed outputs is 1
    • $S_i = \frac{e^{y_i}}{\sum e^{y_i}}$
    • Scaling will determine how “peaky” the softmaxed scores are
  • NLL is the negative log of the prediction of the correct label
  • Cross entropy loss is the NLL of the softmax of the prediction of the correct label
  • Optimization is gradient descent
    • Find the partial derivatives of the loss with respect to the weights
    • We update each weight by the scaled negative of its partial derivative gradient
  • Use a numeric gradient to gradient check

Lecture 4 - April 6

  • Stochastic gradient descent uses minibatches to estimate the gradient
    • Pick your GPU, then biggest minibatch size typically
  • Linear classifiers aren’t very powerful (for many problems)
    • They only learn one template and linear decision boundaries
  • We can extract features, then fit a linear classifier
    • Non-linear transformations (features) are necessary
    • We have to manually create features…if only we could learn them…
  • Simple two-layer NN: $f = W_2 max(0, W_1x)$
    • $x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}$
    • We can think of this as learning templates, then learning combinations of templates
  • There are many different activation functions:
    • ReLU, Sigmoid, Leaky ReLU are (historically) popular ones
  • Activation functions are necessary because multiple consecutive linear transformations can be represented as just one transformation
  • Biological neurons are quite different!
  • We use computational graphs and backpropagation to differentiate (backwards pass)
  • Backprop is simply the chain rule - a lot

Lecture 5 - April 11

  • Vector-vector backprop includes the Jacobian matrix for local operations
  • ImageNet helped people realize that data was super important!
  • Convolutional NNs are a way of including spatial information
  • CNNs are ubiquitous within vision
  • While FC nets flatten images, CNNs preserve spatial structure
  • Filters must be the same depth as the input image (i.e. 3 for RGB)
  • Slide over the entire image, flatten each part of the image it is above, dot product with filter
  • Stride is the offset between each filter-comparison
  • The number of filters is the number of activation weights (and the number of output channels)
  • ConvNet is simply many convolutional layers with activation functions in between
  • Earlier layers learn low-level features, later layers learn higher-level features
  • Ends with a simple FC classifier
  • There are interspersed pooling layers to downsample
  • Output size of an $N \times N$ image with filter size $F \times F$ is: $(N - F) / stride + 1$
  • Often images will be padded with zero pixels
  • 1x1 convolutional layers increase or reduce depth in channel space

Lecture 6 - April 13

  • Training loop:
    • Sample a batch of data
    • Forward prop through the graph to get loss
    • Backprop through the gradients
    • Update the params using the gradient
  • Before you train:
    • Activation functions
    • Preprocessing
    • Weight initialization
    • Regularization
    • Gradient checking
  • Training dynamics:
    • Babysitting the learning process
    • Param updates
    • Hyperparam updates
  • Evaluation:
    • Model ensamples
    • Test-time augmentation
    • Transfer learning
  • Sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$
    • Used to be popular
    • Saturated gradient at high positive and negative values
    • All gradients squashed by at least a factor of 4
    • Outputs aren’t zero-centered which means local gradient is always positive
      • Weights will now all change in the same direction
    • Computationally expensive!
  • Tanh function:
    • Zero centered
    • Still has the problem of dying gradients
  • ReLU (rectified linear unit): $f(x) = max(0, x)$
    • Most common activation function
    • No saturation in positive region
    • Computationally efficient
    • Converges (6x) faster than sigmoid or tanh in practice
    • Not zero centered
    • Can lead to “dying” ReLUs
      • To prevent, often initialize biases with small positive value (e.g. 0.01)
  • Leaky ReLU: $f(x) = max(0.01x, x)$
    • Same a ReLU, but minor gradient in negative region
    • Parametric Rectifier (PReLU) where scaling is a learnable parameter
  • Exponential Linear Units (ELU):
    • “Better”, but more expensive
  • Scaled exponential linear units (SELU):
    • Works better for larger networks, has a normalizing property
    • Can use without BatchNorm
    • Holy Heck Math
    • “Cool”
  • Maxout:
    • Max of multiple linear layers
    • Multiplies the number of parameters :(
  • Swish:
    • RL-created activation function
  • GeLU:
    • Add some randomness to ReLU, then take average to find this
      • “Data dependent dropout”
    • Main activation function used (esp. in transformers)
  • Use ReLU, GeLU if transformers, and try ReLU derivatives
  • We often zero-mean and normalize our data as preprocessing
  • Sometimes preprocessing involves PCA and whitening
  • ResNet subtracted mean across channels and normalized with standard deviation
  • A constant weight initialization leads to all the values being the same
  • Initializations that are too large or too small lead to extreme saturation or clustered outputs
  • “Xavier” initialization: $std = \frac{1}{\sqrt{D_{in}}}$
    • Good with Tanh
    • Attempts to keep output variance similar to input variance
  • “Kaiming” initialization: $std = \sqrt{\frac{2}{D_{in}}}$
    • Works well with ReLU!
  • Batch Normalization:
    • “Things break when inputs don’t have zero mean and $std = 1$” - “Why not just force that?”
    • Subtract by batch mean, divide by square root of the variance of the data
    • It is differentiable!
    • We keep these before each non-linearity
    • We also have two learned parameters, gamma and beta corresponding to scaling and shifting
    • We keep a running mean of mean and variance across our training process

Lecture 7 - April 18

  • BatchNorm
    • Becomes a linear layer during inference
    • “Resets” the standard deviation and mean after change from linear layers
    • Allows higher learning rate and better gradient flow
    • Acts as regularization during training
    • Different behavior during training and testing! Can be a common bug!
    • For CNNs, batch norm across channels (output is: $1 \times C \times 1 \times 1$)
  • LayerNorm:
    • Normalize across each example!
  • Instance normalization is for CNN across height and width
    • Good for segmentation and detection
  • There are a ton of ways to normalize in similar ways
  • Vanilla gradient descent: calculate gradient, move towards the negative gradient
    • Issues related to getting stuck in saddle points
    • Stochastic, so descent can be extremely noisy
  • Momentum keeps optimization moving in the same direction:
    • $v_{t+1} = \rho v_t + \nabla f(x_t)$ then update using $v_{t+1}$
    • Common values for $\rho$ are $0.9$ and $0.99$
    • Momentum will often overshoot minima
  • Nesterov momentum: take the velocity update before taking derivative
    • Good, but we have to update weights twice, so we use a different formulation where $\tilde{x}_t = x_t + \rho v_t$:
  • SGD + momentum or Nesterov are often what is used in practice
  • AdaGrad sums squared squared gradients, dividing each gradient by the square root of the sum of its squared derivatives:
    • This leads to weights with large and small gradients getting updated slower and quicker
    • It also quickly decays all learning rates to zero - a problem for neural networks
  • RMSProp (Leaky AdaGrad) is AdaGrad but with a decay to previous squared grads - similar to a running average
    • Keeps step sizes relatively constant
    • Common decay rate is $0.99$
    • Doesn’t overshoot as much
    • More computationally expensive
  • Adam: combine the best of RMSProp and AdaGrad
      first_moment = 0
      second_moment = 0
      for t in range(1, num_iterations):
          dx = compute_gradient(x)
          first_moment = beta1 * first_moment + (1 - beta1) * dx
          second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
          first_unbias = first_moment / (1 - beta1 ** t)
          second_unbias = second_moment / (1 - beta2 ** t)
          x -= learning rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)
    • There is a bias correction to prevent large early step sizes from instantly destroying initialization
    • Typical hyperparams are beta1 = 0.9 and beta2 = 0.99
  • L2 regularization and weight decay are the same when using SGD (with momentum), but different for Adam, AdaGrad, RMSProp
  • AdamW: the go-to optimizer
    • Adam with decoupled weight decay and $L_2$ regularization
    • Allows user to choose if they want weights to be a part of the second moment or not
  • There are second-order optimization techniques where you move in the negative direction of the Hessian
    • This is typically intractable due to the number of parameters
    • AdaGrad is actually a special case of a second-order optimization technique where we assume the Hessian is diagonal
    • Second-order optimization is not typically used in practice

Lecture 8 - April 20

  • Learning rate decay: scaling down the learning rate over time
    • Necessary to decrease loss beyond a certain point
    • Hyperparameter choice is extremely important
  • There are many learning rate schedulers:
    • “Step” down the learning rate after a fixed number of epochs
      • Leads to massive drops in loss followed by plateaus repeatedly
    • Cosine learning rate decay: $\alpha_t = \frac{1}{2}\alpha_0(1+cos(\frac{t\pi}{T}))$
      • $\alpha_0$: initial learning rate, $\alpha_t$ learning rate at step $t$, $T$ total number of steps
      • Constant decrease in loss over time
    • Linear learning rate decay
    • Inverse square root decay
    • Constant learning rate decay
  • Learning rate warmup: spending first iterations increasing learning rate
    • A large initial learning rate will cause our weight initialization to blow up
    • Linearly increase learning rate over about 5 epochs (~5000 iterations)
    • If you increase batch size by $N$, also increase learning rate by $N$
  • Validation data is a good way of paying attention to the model
  • Early stopping is ending model training when test loss plateaus
    • Often for ~5 epochs
  • Training multiple models and then averaging results is a model ensemble:
    • Exhibits about ~2% better performance in the real world
    • Different models overfit to different parts of the dataset, it is averaged out
  • Regularization techniques previously covered: L2, L1, weight decay
  • Dropout: every forward pass, set parameters to zero with probability $p$
    • Increases redundancy across the entire network, prevents some overfitting
    • Common value of $p = 0.5$
    • An interpretation of dropout is an ensemble of models with shared parameters
    • At test time, multiply by $p$ or multiply by $\frac{1}{p}$ after each test example (inverted dropout)
  • A common pattern with regularization is adding randomness during training and averaging out during testing
    • Seen in batch norm and dropout for example
  • Data augmentation is a common form of regularization:
    • Transforming the input in such a way that the label is the same
  • Some example image augmentation techniques:
    • Flipping an image horizontally
    • Random crops and scales of an image
      • Testing: average a fixed set of crops of a test image
    • Change contrast or color of images
    • Stretching or contorting images
  • We are now training models to learn good data augmentation techniques
  • DropConnect: set connections to weights to zero (instead of weights in dropout)
  • Fractional pooling pools random regions of each image
    • Testing: average predictions over multiple regions
  • Stochastic depth: skip entire layers in a network using residual connections
  • Cutout: randomly set parts of image to average image color
    • Good for small datasets, not often used on large ones
  • Mixup: blend both training images and training labels by an amount
    • Why does it work? Who cares!
  • CutMix: replace random crops of one image with another while combining labels
  • Label smoothing: set target label to: $1 - \frac{K-1}{K}\epsilon$ and other labels to: $\frac{\epsilon}{K}$ for $K$ classes
  • In practice:
    • Use dropout for FC layers
    • Using batchnorm is always good
    • Try Cutout, MixUp, CutMix, Stochastic Depth, Label Smoothing for (a little) extra performance
  • Grid search is okay for hyperparameter search
    • Use log-linear values
  • Random search is better:
    • Use log-uniform randomness in given range
    • Likely because some hyperparamters matter more than others, so more opportunities to get it exact than grid search
  • How to choose hyperparameters without Google-level compute:
    1. Check initial loss: sanity check, turn off weight decay
    2. Overfit on a small sample (5-10 minibatches)
      • Make some architectural decisions
      • Loss not decreasing? LR too low, bad weight init
      • Loss NaN or Inf? LR too high, also bad weight init
    3. Search for learning rate for ~100 iterations
      • Good LR to think about: 1e-1 to 1e-4
    4. Coarse search: add other hyperparams and train a few models for ~1.5 epochs
      • Good weight decay to try: 1e-4, 1e-5, 0
    5. Pick best models from 4., train for ~10-20 epochs without learning rate decay
    6. Look at learning curves and adjust
      • Might need early stopping, adjust regularization, larger model, or keep going
      • Flat start to loss graph means bad initialization
      • When learning rate plateaus, add a scheduler (cosine!)
      • Use a “command center” like Weights and Biases (or add your own!)
    7. Return to step 5!
  • Linear classifiers are easy to visualize because they have only one filter
  • Early layers in deep learning networks are also easy to visualize, these are filters
    • Often edge detection
  • The last layers are an embedded representation of the inputs
    • The embeddings are much better for KNN
  • Google search:
    • Embeds search into a 100 or 200 dimensional vector, then runs KNN on a massive database
    • Needs massive compute however
  • We can use PCE or t-SNE to visualize the outputs of the last layer of a network
  • Good to use KNN rather than simply labels so we can see what is happening under the hood
  • We can visualize activation maps for CNNs
  • To find what activates neurons the best, run patches of images from the dataset through the model and see which output the full image’s label
  • Occlusion using patches is another way of visualizing which pixels matter, can be graphed for a cool image
  • Shapley values are a way of using multiple patches to determine important areas of the image
  • Compute backprop to the images to find activations for saliency maps
    • Super good image segmentation (accidentally)

Lecture 8 - April 25

  • Saliency maps allow us to view biases within misclassifications
    • Clamp gradients to only negative values
  • We can also “backprop” a gray image to make an example image
    • Gradient ascent and a heck of a lot of regularization
  • Adversarial examples:
    1. Start from an arbitrary image
    2. Pick an arbitrary class
    3. Modify the image to improve class scores
    4. Repeat!
  • Black box attacks:
    • Adding random noise to images and models get confused…
  • Supervised learning is insanely expensive:
    • Labeling ImageNet’s 1.4M images would cost more than $175,000
  • Unsupervised learning: model isn’t told what to predict
  • Self-Supervised learning: model predicts some naturally occurring signal in the data
    • The goal is to learn a cheap “pretext” task which learns important features
    • Target is something that is easy to compute
  • For example, start with autoencoder, then fine-tune
  • Three main types of pretext tasks:
    • Generative
      • Autoencoders, GANs
    • Discriminative
      • Contrastive
    • Multimodal
      • Input video, output audio
  • Pretext task performance is irrelevant
  • Often just toss a linear classifier on the back of the encoder
  • Generative supervised learning:
    • Generate some data from an example
  • Computers (NNs) are not rotation-invariant:
    • Allows us to predict rotation as a pretext task
  • The learned attentions are similar as to what supervised learning finds
  • Predict relative patch locations:
    • Break image into 3x3 grid, pass in center square and an outer square and predict location
    • CNNs are size invariant
  • Jigsaw puzzle:
    • Reorder patches according to correct permutation
  • Inpainting:
    • Pass in image with missing patch, predict the missing patch
    • Adding adversarial loss increases image recreation quality
  • Image coloring:
    • Use “LAB” coloring, pass in L (grayscale) and predict AB (color)
  • Split-brain autoencoder:
    • Predict color from light, predict light from color
  • Video coloring:
    • Given colored start frame and grayscale images from then on, predict the colors
    • Uses attention mechanism
  • Contrastive learning:
    • Create many different examples from an orignal example
    • These examples are going to be in the same sematic space
    • Learn which examples are closer or further away from each other

Lecture 10 - April 27

  • Contrastive learning:
    • Get a reference example $x$
    • Create transformed or augmented examples from $x$, called $x^+$
    • Label all other examples $x^-$
    • Maximize $score(f(x), f(x^+))$ and minimize $score(f(x), f(x^-))$
    • Loss function is derived from softmax
      • “Every single instance is it’s own class”
    • Uses cosine similarity
      • $\frac{u^T v}{   u       v   }$
    • Generate positive samples through simple data augmentation
  • SimCLR Framework:
    • Make two simple transformations from the example
    • Run each transformed example through a NN to get representation
    • Run each representation through a simple linear classifier or MLP
    • Maximize the cosine similarity of those outputs
  • Example transformations for images:
    • Random cropping
    • Random color distortion
    • Distortion blur
  • Create a minibatch matrix ($2N$ by $2N$) of alternating example-transformed images:
  • Run it through the model
  • Take the cosine similarity of the matrix with itself
  • The $(2k, 2k+1)$ and $(2k+1, 2k)$ scores should be positive, everything else should be negative
    • Diagonal will always be 1
    • See slides for illustration
  • SimCLR works extremely well with very large batch sizes (64,000+)
  • We don’t want to directly expose the representation to the loss function, so we use a MLP head
  • MoCo: Momentum Contrastive Learning
    • Keep a running queue of keys (negative examples) for all images in the batch
      • If we have 1000 examples and 2000 negatives, it is $1000 \cdot 2000$
    • Update the encoder only through the queries (reference images)
    • Makes the momentum encoder be much less computationally expensive
      • We update the momentum encoder via: $\theta_k \leftarrow m \theta_k + (1 - m) \theta_q$
      • Slowly aligns the two networks
    • Uses cross-entropy loss
      • Treats each negative as a class
      • One correct label (this comes from the two parallel transformations) and the incorrect labels (negatives) are from the queue
      • These don’t need to be run in parallel or there could be a concatenation
  • MoCo V2: Add a non-linear projection head and better data augmentation
  • DINO: do we need negatives? (very recent)
    • Reformulates contrastive learning and a distillation problem
    • Teacher model is like the momentum encoder
      • Running average of the student model
      • Sees a global view augmentation of the image
    • Student model only sees cropped augmentation of the image
  • DINO training tricks for the teacher:
    • Center the data by adding a mean
    • Sharpen the distribution towards a certain class - like a temperature
      • Has the effect of making the teacher be a bit of a classifier
  • DINO V2 relased this week!
  • Contrastive Predictive Coding (CPC)
    • Contrastive: contrast between correct and incorrect sequences using contrastive learning
    • Predictive: model must predict future patterns based on current
    • Coding: model must learn useful feature vectors
  • We give the model context, then a correct continuation sequence and many incorrect possible sequences
  • First encode all images into vectors: $z_t = g_{enc}(x_t)$
  • Summarize all context into a conctext code: $c_t$ using an autoregressive model: $g_{ar}$
  • Compute InfoNCE loss between context $c_t$ and future code $z_{t+k}$ using scoring function:
    • $s_k(z_{t+k}, c_t) = z^T_{t+k} W_k c_t$
    • $W_k$ is a trainable matrix
  • CLIP is a contrastive learning model
  • We can sequentially process non-sequential data - think about how we observe an image
  • Variable length sequences are tough to work with for basic NNs
  • Recurrent Neural Networks contain an internal “state”, a summary or context of what’s been seen before
  • RNN formulation: $h_t = f_W(h_{t-1}, x_t)$ (repeat as necessary!)
  • Typical autoregressive loss function - run everything through, make predictions, then sum/mean the losses from those predictions

Lecture 11 - May 2

  • Vanilla RNN learns three weight matricies:
    • Transforms input
    • Transforms context state (sum these last two then use tanh)
    • Transforms context state to output
  • Hidden representation is commonly initialized to zero
    • One could learn the initial weight matrix, but not necessary
  • Sequence length is an assumption we often have to make during training
  • Encoder-decoder architecture for sequences encodes sequence into a representation, then decodes into a sequence from that representation
    • Think language translation, orignal attention paper
  • One hot vector is a vector of all zeros and a one corresponding to a single class
  • We use an embedding layer - computationally inexpensive (indexing), but keeps gradient flowing!
  • Test time - autoregressive model (I wrote about this!), similar to a phone’s text autocomplete
  • We truncate sequences, these act as minibatches
  • We get surprising emergent behaviors from sequence modeling
  • RNN advantages:
    • Can process sequential information
    • Can use information from many steps back
    • Symmetrical processes for each step
  • RNN disadvantages:
    • Recurrent computation is slow
    • Infomation is lost over a sequence in practice
    • Exploding gradients :(
  • Image classification: combine RNNs and CNNs
    • Take a CNN and remove the classification head
    • Use the image representation as the first hidden representation in an RNN
    • Repeatedly sample tokens until the <END> token is produced
  • Question answering: use RNN and CNN to generate representations, learn a compression, then softmax across the language
  • Agents can learn instructions and actions to take in a lanugage or image based environment (same principles as before)
  • Multilayer RNNs: stacking layer weights and adding depth
  • Vanilla RNN gradient flow: look up the derivation!
    • Gradients will vanish as the tanh squishes the gradient for each step
    • Even without tanh, this problem is repeated - gradients will explode or vanish
    • Note: normalization doesn’t work well with RNNs, active field of research
  • To combat exploding gradients we can use gradient “clipping”, scaling the gradient if its norm is too large
    • However, there is no good solution for vanishing gradients
  • LSTMs (Long Short Term Memory) help solve the problems within RNNs
    • LSTMs keep track of two values: a cell memory an next hidden representation
      • Long and short term memory!
    • Usually both initialized to zero
  • LSTMs produce four outputs $4h$ from the hidden state $h$ and the vector from below $x$
    • This can be done in parallel through matrix multiplication
    • Three of these outputs are passed through a sigmoid nonlinearity, the fourth is passed through a tanh
  • Cell memory: $c_t = f \cdot c_{t - 1} + i \cdot g$, $h_t = o \cdot tanh(c_t)$
    • The “info gate” (output of tanh) determines how much to write to cell: $g$
    • The “input gate” $i$ determines whether to write to cell
    • The “forget gate” $f$ determines how much to forget or erase from cell
    • The “output gate” $o$ determines how much to reveal cell at a timestep
  • LSTMs create a “highway network” over for gradients to flow back over time through the cell memory
  • LSTMs preserve information better over time, however there’s no guarantee
  • Residual (type) connections are very popular and widespread throughout deep learning!
  • Neural architecture search for LSTM-like architecture was popular in the past, but no more
  • GRUs (Gated Recurrent Unit) are inspired by LSTMs
    • Quite simple and common because of ease of training
  • LSTMs were quite popular until this year, hwoever transformers have risen in popularity

Lecture 12 - May 9

  • Recurrent image captioning is constrained by the size of the image representation vector
  • Attention: use different context vector at different timesteps
    • Use some function to get relationship scores within different elements in a vector
    • Softmax to normalize the scores
    • Use these scores to create output context vector (multiply by some value vector)
    • This is just scaled dot-product attention
  • Also works for interpretability: allows for bias correction as well
  • Attention Layer:
    • We matmul a query and key to create scores
    • Softmax the scores then to normalize
    • Multiply the scores by the values, then sum
  • Query, key, and value come from linear transformations
  • Notes:
    • We need to use masked-attention for sequence problems
    • We need to inject positional encodings as attention layers are position-invariant
      • Simply add positional vectors (often sinusoidal)
    • We also scale the dot-product by dividing by $\sqrt{d}$
      • Important for autoregressive problems
      • We love torch.triu
  • Self attention doesn’t require separate queries and values, they come from the same input
  • Multi-head self attention:
    • Split the dimensionality up, use attention for each part, then concatenate
  • One of the biggest issues: attention is an $n^2$ memory requirement
  • Transformers: sequence to sequence model (encoder-decoder)
    • Encoder:
      • Made up of multiple “encoder blocks”
      • Each encoder block has a multi-head self-attention block (with residual connection), layernorm, MLP (with residual connection), then a second layer norm
    • Decoder:
      • Made up of multiple “decoder blocks”
      • Masked multi-head self attention (residual), layernorm, cross attention (with encoder output and residual), layernorm, MLP (residual), second layernorm
  • CNNs are often replaced by transformers which split data up into patches
    • Good if you have a lot of data, but small data you want to use CNNs
    • Vision transformer (ViT)
  • Image captioning can work now with purely transformer-based architectures
  • Transformer (size) history: (see slides)
    • Started with 12 layers, 8 heads, 65M params

Lecture 13 - May 11

  • LeNet, small network of convolutional layers
    • We pool to downsample images
  • AlexNet: bigger model
    • Trained individually on multiple GPUs
    • First use of ReLU
    • Data augmentation
    • Dropout 0.5
    • 7 model ensamble
  • VGGNet: smaller filters, deeper network
  • GoogLeNet: added multiple paths between layers (InceptionNet)
    • In the future, 1x1 convolutions were added to downscale filter dimensions (reduce computational costs)
    • Also add auxiliary classification outputs earlier to keep gradient flow
  • ResNet: add residual connections between layers
    • Keeps gradients flowing
    • Also allows model to find the difference
    • All current networks start with a conv layer
    • Dropout, Kaiming init
  • ViT: Vision Transformer
    • Add a convolution in the first layer to create patches
    • Then just run through transformer blocks
    • ViT needs more data
    • Trained on a dataset called JFT-300M
      • ViT performs worse than ResNet on 10M images
    • Final layer is finetuned on ImageNet-1.5M
  • MLP-Mixer: all MLP architecture
    • Full of Mixer Layers
    • Mixer layer is layer norm, transpose to get patches, MLP, transpose to get channels, layernorm, second MLP
  • ResNet improvements:
    • Change normalization ordering
    • Wider (more filters) networks
    • ResNeXT: multiple paths inception-style
    • DenseNet: add more residual pathways
    • MobileNet: use channel downsampling 1x1 conv layers
    • Neural Architecture Search: NAS
      • EfficientNet: fast, accurate, small