CSE 493 - Deep Learning

Course Website

This course is heavily based off of CS231n
This course will primarily focus on NLP and Computer Vision
Course is multiple parts:
- Deep learning fundamentals
- Practical training skills
- Application
Vision has been one of the drivers of early DL, important to understand the history
Books can be helpful, but not necessary
Gradescope has automatic and private testing
Three psets, one midterm, and a final group project

Lecture 1 - March 28

543 million years ago vision animals started to develop sight
Camera obscura developed in 1545 to study eclipses
- Inspired da Vinci to create the pinhole camera
1959 Hubel & Wiesel found that we visually react to “edges” and “blobs”
- Think of this a “lower layer”
Larry Roberts is known as the “Father of Computer Vision” - wrote the first CV thesis
1960’s MIT attempted to solve vision in a summer - this didn’t happen
David Marr introduced the idea of stages of visual recognition in 1970’s
Edge detection became the next big push in CV
In the 1980’s expert systems became popular
- These had heuristic rules made by domain “experts”
- Unsuccessful and caused the second AI Winter
Irving Biederman came up with rules on how we view the world
- 1: We must understand components (objects and relationships)
- 2: This is only possible because we see so many objects learning
  - A 6 year old child has seen 30,000 objects
We can detect animals in 150 ms
- We detect predators and the color red even quicker!
Later-stage neurons allowed us to detect complex object or themes
In the 1990’s research started on start on real-world images
- Algorithms were developed for grouping (1990’s) and matching (2000’s)
In 2001 the first commercial success in CV
- Facial detection, used ML and facial features
In the 2000’s feature development was all the rage
- Histogram of gradients - how do the edges in pixels move?
We need an incredible amount of data - led to ImageNet
- 2009, had 22K categories and 14M images
In 2012 AlexNet had breakthrough performance on ImageNet
- By 2015 all attempts were DL and better than humans
In 1957 the Mark I Perception was created for character recognition
- Manually twisted knobs to tune (adjusted weights)
- Cannot be trained practically
Backpropagation was developed in 1986
LeNet is the architecture used in the Postal Service - 1998
- AlexNet is the same architecture
DL was used in the early 2000’s to compress images
Everything is homogenized now
- Transformers and backprop are the norm
- Data and compute are the differentiators
- Domains change, but core is often the same
Hinton, Bengio, and LeCun won the Turing award in 2018
Deep learning is it’s own course because of incredible growth

Lecture 2 - March 30

Image classification (IC) is a core task in CV
There are many challenges related to computer vision:
- Viewpoint variation
- Illumination
- Background clutter
- Occlusion
- Deformation
- Intraclass variation
- It is very difficult to implement an image classifier as “normal” software
  - “old” AI
- Data-driven paradigm is better
  1. Use datasets to train a model
    - MNIST
    - CFIAR 10(0)
    - ImageNet
    - MIT Places
    - Omniglot
  2. Use ML to train a classifier
  3. Evaluate the classifier on new images
Nearest Neighbor classifier
- Training: memorize all data and labels
- Inference: predict label of “nearest” image
- $O(1)$ train time, $O(n)$ inference time
- It is a universal approximator with n -> infinity data points
What is “nearest” (distance)?
- L1 norm is bad (Manhattan)
- L2 norm is good (Euclidean)
Hyperparaamters are choices (configs) of the model
- e.g. k, distance type
Finding hyperparams
- We should use a train, val, and test ds
- Cross validation: split data into folds - no dedicated val necessary
Curse of dimensionality: the number of data points necessary is related to the dimension of data
Linear classifiers take on the form: $y = Wx + b$
- $b$ is often omitted, instead appending data vector with a one
- Parametric approach
- Learns a template and decision boundary

Lecture 3 - April 4

It’s cool that we have a classifier, but how do we make it good?
Loss function: define how good or bad our classifier (weights) are
Loss over dataset is the average of loss over examples
- $x_i$, $y_i$, and $L_i$ are example data, output label, and example loss
Multiclass SVM loss:
- Make sure that prediction of correct label is greater than all predictions for other labels by a given margin
- Delta doesn’t matter because it will simply scale
- Linear-ish loss
- Issues because piecewise function, doesn’t approach zero asymptotically
Squaring losses leads to predictions being penalized by more by a factor of incorrect-ness
Regularization is important, we always want the simplest model (no overfitting!)
- Often adding a fraction of the L1 or L2 on the weights.
- Spread out the weights
Softmax classifier:
- Pushes each value between 0 and 1
- Ensures the sum of the softmaxed outputs is 1
- $S_i = \frac{e^{y_i}}{\sum e^{y_i}}$
- Scaling will determine how “peaky” the softmaxed scores are
NLL is the negative log of the prediction of the correct label
Cross entropy loss is the NLL of the softmax of the prediction of the correct label
Optimization is gradient descent
- Find the partial derivatives of the loss with respect to the weights
- We update each weight by the scaled negative of its partial derivative gradient
Use a numeric gradient to gradient check

Lecture 4 - April 6

Stochastic gradient descent uses minibatches to estimate the gradient
- Pick your GPU, then biggest minibatch size typically
Linear classifiers aren’t very powerful (for many problems)
- They only learn one template and linear decision boundaries
We can extract features, then fit a linear classifier
- Non-linear transformations (features) are necessary
- We have to manually create features…if only we could learn them…
Simple two-layer NN: $f = W_2 max(0, W_1x)$
- $x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}$
- We can think of this as learning templates, then learning combinations of templates
There are many different activation functions:
- ReLU, Sigmoid, Leaky ReLU are (historically) popular ones
Activation functions are necessary because multiple consecutive linear transformations can be represented as just one transformation
Biological neurons are quite different!
We use computational graphs and backpropagation to differentiate (backwards pass)
Backprop is simply the chain rule - a lot

Lecture 5 - April 11

Vector-vector backprop includes the Jacobian matrix for local operations
ImageNet helped people realize that data was super important!
Convolutional NNs are a way of including spatial information
CNNs are ubiquitous within vision
While FC nets flatten images, CNNs preserve spatial structure
Filters must be the same depth as the input image (i.e. 3 for RGB)
Slide over the entire image, flatten each part of the image it is above, dot product with filter
Stride is the offset between each filter-comparison
The number of filters is the number of activation weights (and the number of output channels)
ConvNet is simply many convolutional layers with activation functions in between
Earlier layers learn low-level features, later layers learn higher-level features
Ends with a simple FC classifier
There are interspersed pooling layers to downsample
Output size of an $N \times N$ image with filter size $F \times F$ is: $(N - F) / stride + 1$
Often images will be padded with zero pixels
1x1 convolutional layers increase or reduce depth in channel space

Lecture 6 - April 13

Training loop:
- Sample a batch of data
- Forward prop through the graph to get loss
- Backprop through the gradients
- Update the params using the gradient
Before you train:
- Activation functions
- Preprocessing
- Weight initialization
- Regularization
- Gradient checking
Training dynamics:
- Babysitting the learning process
- Param updates
- Hyperparam updates
Evaluation:
- Model ensamples
- Test-time augmentation
- Transfer learning
Sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$
- Used to be popular
- Saturated gradient at high positive and negative values
- All gradients squashed by at least a factor of 4
- Outputs aren’t zero-centered which means local gradient is always positive
  - Weights will now all change in the same direction
- Computationally expensive!
Tanh function:
- Zero centered
- Still has the problem of dying gradients
ReLU (rectified linear unit): $f(x) = max(0, x)$
- Most common activation function
- No saturation in positive region
- Computationally efficient
- Converges (6x) faster than sigmoid or tanh in practice
- Not zero centered
- Can lead to “dying” ReLUs
  - To prevent, often initialize biases with small positive value (e.g. 0.01)
Leaky ReLU: $f(x) = max(0.01x, x)$
- Same a ReLU, but minor gradient in negative region
- Parametric Rectifier (PReLU) where scaling is a learnable parameter
Exponential Linear Units (ELU):
- “Better”, but more expensive
Scaled exponential linear units (SELU):
- Works better for larger networks, has a normalizing property
- Can use without BatchNorm
- Holy Heck Math
- “Cool”
Maxout:
- Max of multiple linear layers
- Multiplies the number of parameters :(
Swish:
- RL-created activation function
GeLU:
- Add some randomness to ReLU, then take average to find this
  - “Data dependent dropout”
- Main activation function used (esp. in transformers)
Use ReLU, GeLU if transformers, and try ReLU derivatives
We often zero-mean and normalize our data as preprocessing
Sometimes preprocessing involves PCA and whitening
ResNet subtracted mean across channels and normalized with standard deviation
A constant weight initialization leads to all the values being the same
Initializations that are too large or too small lead to extreme saturation or clustered outputs
“Xavier” initialization: $std = \frac{1}{\sqrt{D_{in}}}$
- Good with Tanh
- Attempts to keep output variance similar to input variance
“Kaiming” initialization: $std = \sqrt{\frac{2}{D_{in}}}$
- Works well with ReLU!
Batch Normalization:
- “Things break when inputs don’t have zero mean and $std = 1$” - “Why not just force that?”
- Subtract by batch mean, divide by square root of the variance of the data
- It is differentiable!
- We keep these before each non-linearity
- We also have two learned parameters, gamma and beta corresponding to scaling and shifting
- We keep a running mean of mean and variance across our training process

Lecture 7 - April 18

BatchNorm
- Becomes a linear layer during inference
- “Resets” the standard deviation and mean after change from linear layers
- Allows higher learning rate and better gradient flow
- Acts as regularization during training
- Different behavior during training and testing! Can be a common bug!
- For CNNs, batch norm across channels (output is: $1 \times C \times 1 \times 1$)
LayerNorm:
- Normalize across each example!
Instance normalization is for CNN across height and width
- Good for segmentation and detection
There are a ton of ways to normalize in similar ways
Vanilla gradient descent: calculate gradient, move towards the negative gradient
- Issues related to getting stuck in saddle points
- Stochastic, so descent can be extremely noisy
Momentum keeps optimization moving in the same direction:
- $v_{t+1} = \rho v_t + \nabla f(x_t)$ then update using $v_{t+1}$
- Common values for $\rho$ are $0.9$ and $0.99$
- Momentum will often overshoot minima
Nesterov momentum: take the velocity update before taking derivative
- Good, but we have to update weights twice, so we use a different formulation where $\tilde{x}_t = x_t + \rho v_t$:
SGD + momentum or Nesterov are often what is used in practice
AdaGrad sums squared squared gradients, dividing each gradient by the square root of the sum of its squared derivatives:
- This leads to weights with large and small gradients getting updated slower and quicker
- It also quickly decays all learning rates to zero - a problem for neural networks
RMSProp (Leaky AdaGrad) is AdaGrad but with a decay to previous squared grads - similar to a running average
- Keeps step sizes relatively constant
- Common decay rate is $0.99$
- Doesn’t overshoot as much
- More computationally expensive

Adam: combine the best of RMSProp and AdaGrad

  first_moment = 0
  second_moment = 0
  for t in range(1, num_iterations):
      dx = compute_gradient(x)
      first_moment = beta1 * first_moment + (1 - beta1) * dx
      second_moment = beta2 * second_moment + (1 - beta2) * dx * dx
      first_unbias = first_moment / (1 - beta1 ** t)
      second_unbias = second_moment / (1 - beta2 ** t)
      x -= learning rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)

There is a bias correction to prevent large early step sizes from instantly destroying initialization
Typical hyperparams are beta1 = 0.9 and beta2 = 0.99

L2 regularization and weight decay are the same when using SGD (with momentum), but different for Adam, AdaGrad, RMSProp
AdamW: the go-to optimizer
- Adam with decoupled weight decay and $L_2$ regularization
- Allows user to choose if they want weights to be a part of the second moment or not
There are second-order optimization techniques where you move in the negative direction of the Hessian
- This is typically intractable due to the number of parameters
- AdaGrad is actually a special case of a second-order optimization technique where we assume the Hessian is diagonal
- Second-order optimization is not typically used in practice

Lecture 8 - April 20

Learning rate decay: scaling down the learning rate over time
- Necessary to decrease loss beyond a certain point
- Hyperparameter choice is extremely important
There are many learning rate schedulers:
- “Step” down the learning rate after a fixed number of epochs
  - Leads to massive drops in loss followed by plateaus repeatedly
- Cosine learning rate decay: $\alpha_t = \frac{1}{2}\alpha_0(1+cos(\frac{t\pi}{T}))$
  - $\alpha_0$: initial learning rate, $\alpha_t$ learning rate at step $t$, $T$ total number of steps
  - Constant decrease in loss over time
- Linear learning rate decay
- Inverse square root decay
- Constant learning rate decay
Learning rate warmup: spending first iterations increasing learning rate
- A large initial learning rate will cause our weight initialization to blow up
- Linearly increase learning rate over about 5 epochs (~5000 iterations)
- If you increase batch size by $N$, also increase learning rate by $N$
Validation data is a good way of paying attention to the model
Early stopping is ending model training when test loss plateaus
- Often for ~5 epochs
Training multiple models and then averaging results is a model ensemble:
- Exhibits about ~2% better performance in the real world
- Different models overfit to different parts of the dataset, it is averaged out
Regularization techniques previously covered: L2, L1, weight decay
Dropout: every forward pass, set parameters to zero with probability $p$
- Increases redundancy across the entire network, prevents some overfitting
- Common value of $p = 0.5$
- An interpretation of dropout is an ensemble of models with shared parameters
- At test time, multiply by $p$ or multiply by $\frac{1}{p}$ after each test example (inverted dropout)
A common pattern with regularization is adding randomness during training and averaging out during testing
- Seen in batch norm and dropout for example
Data augmentation is a common form of regularization:
- Transforming the input in such a way that the label is the same
Some example image augmentation techniques:
- Flipping an image horizontally
- Random crops and scales of an image
  - Testing: average a fixed set of crops of a test image
- Change contrast or color of images
- Stretching or contorting images
We are now training models to learn good data augmentation techniques
DropConnect: set connections to weights to zero (instead of weights in dropout)
Fractional pooling pools random regions of each image
- Testing: average predictions over multiple regions
Stochastic depth: skip entire layers in a network using residual connections
Cutout: randomly set parts of image to average image color
- Good for small datasets, not often used on large ones
Mixup: blend both training images and training labels by an amount
- Why does it work? Who cares!
CutMix: replace random crops of one image with another while combining labels
Label smoothing: set target label to: $1 - \frac{K-1}{K}\epsilon$ and other labels to: $\frac{\epsilon}{K}$ for $K$ classes
In practice:
- Use dropout for FC layers
- Using batchnorm is always good
- Try Cutout, MixUp, CutMix, Stochastic Depth, Label Smoothing for (a little) extra performance
Grid search is okay for hyperparameter search
- Use log-linear values
Random search is better:
- Use log-uniform randomness in given range
- Likely because some hyperparamters matter more than others, so more opportunities to get it exact than grid search
How to choose hyperparameters without Google-level compute:
1. Check initial loss: sanity check, turn off weight decay
2. Overfit on a small sample (5-10 minibatches)
  - Make some architectural decisions
  - Loss not decreasing? LR too low, bad weight init
  - Loss NaN or Inf? LR too high, also bad weight init
3. Search for learning rate for ~100 iterations
  - Good LR to think about: 1e-1 to 1e-4
4. Coarse search: add other hyperparams and train a few models for ~1.5 epochs
  - Good weight decay to try: 1e-4, 1e-5, 0
5. Pick best models from 4., train for ~10-20 epochs without learning rate decay
6. Look at learning curves and adjust
  - Might need early stopping, adjust regularization, larger model, or keep going
  - Flat start to loss graph means bad initialization
  - When learning rate plateaus, add a scheduler (cosine!)
  - Use a “command center” like Weights and Biases (or add your own!)
7. Return to step 5!
Linear classifiers are easy to visualize because they have only one filter
Early layers in deep learning networks are also easy to visualize, these are filters
- Often edge detection
The last layers are an embedded representation of the inputs
- The embeddings are much better for KNN
Google search:
- Embeds search into a 100 or 200 dimensional vector, then runs KNN on a massive database
- Needs massive compute however
We can use PCE or t-SNE to visualize the outputs of the last layer of a network
Good to use KNN rather than simply labels so we can see what is happening under the hood
We can visualize activation maps for CNNs
To find what activates neurons the best, run patches of images from the dataset through the model and see which output the full image’s label
Occlusion using patches is another way of visualizing which pixels matter, can be graphed for a cool image
Shapley values are a way of using multiple patches to determine important areas of the image
Compute backprop to the images to find activations for saliency maps
- Super good image segmentation (accidentally)

Lecture 8 - April 25

Saliency maps allow us to view biases within misclassifications
- Clamp gradients to only negative values
We can also “backprop” a gray image to make an example image
- Gradient ascent and a heck of a lot of regularization
Adversarial examples:
1. Start from an arbitrary image
2. Pick an arbitrary class
3. Modify the image to improve class scores
4. Repeat!
Black box attacks:
- Adding random noise to images and models get confused…
Supervised learning is insanely expensive:
- Labeling ImageNet’s 1.4M images would cost more than $175,000
Unsupervised learning: model isn’t told what to predict
Self-Supervised learning: model predicts some naturally occurring signal in the data
- The goal is to learn a cheap “pretext” task which learns important features
- Target is something that is easy to compute
For example, start with autoencoder, then fine-tune
Three main types of pretext tasks:
- Generative
  - Autoencoders, GANs
- Discriminative
  - Contrastive
- Multimodal
  - Input video, output audio
Pretext task performance is irrelevant
Often just toss a linear classifier on the back of the encoder
Generative supervised learning:
- Generate some data from an example
Computers (NNs) are not rotation-invariant:
- Allows us to predict rotation as a pretext task
The learned attentions are similar as to what supervised learning finds
Predict relative patch locations:
- Break image into 3x3 grid, pass in center square and an outer square and predict location
- CNNs are size invariant
Jigsaw puzzle:
- Reorder patches according to correct permutation
Inpainting:
- Pass in image with missing patch, predict the missing patch
- Adding adversarial loss increases image recreation quality
Image coloring:
- Use “LAB” coloring, pass in L (grayscale) and predict AB (color)
Split-brain autoencoder:
- Predict color from light, predict light from color
Video coloring:
- Given colored start frame and grayscale images from then on, predict the colors
- Uses attention mechanism
Contrastive learning:
- Create many different examples from an orignal example
- These examples are going to be in the same sematic space
- Learn which examples are closer or further away from each other

Lecture 10 - April 27

Contrastive learning:

Get a reference example $x$
Create transformed or augmented examples from $x$, called $x^+$
Label all other examples $x^-$
Maximize $score(f(x), f(x^+))$ and minimize $score(f(x), f(x^-))$
Loss function is derived from softmax
- “Every single instance is it’s own class”

Uses cosine similarity

$\frac{u^T v}{

Generate positive samples through simple data augmentation

SimCLR Framework:
- Make two simple transformations from the example
- Run each transformed example through a NN to get representation
- Run each representation through a simple linear classifier or MLP
- Maximize the cosine similarity of those outputs
Example transformations for images:
- Random cropping
- Random color distortion
- Distortion blur
Create a minibatch matrix ($2N$ by $2N$) of alternating example-transformed images:
Run it through the model
Take the cosine similarity of the matrix with itself
The $(2k, 2k+1)$ and $(2k+1, 2k)$ scores should be positive, everything else should be negative
- Diagonal will always be 1
- See slides for illustration
SimCLR works extremely well with very large batch sizes (64,000+)
We don’t want to directly expose the representation to the loss function, so we use a MLP head
MoCo: Momentum Contrastive Learning
- Keep a running queue of keys (negative examples) for all images in the batch
  - If we have 1000 examples and 2000 negatives, it is $1000 \cdot 2000$
- Update the encoder only through the queries (reference images)
- Makes the momentum encoder be much less computationally expensive
  - We update the momentum encoder via: $\theta_k \leftarrow m \theta_k + (1 - m) \theta_q$
  - Slowly aligns the two networks
- Uses cross-entropy loss
  - Treats each negative as a class
  - One correct label (this comes from the two parallel transformations) and the incorrect labels (negatives) are from the queue
  - These don’t need to be run in parallel or there could be a concatenation
MoCo V2: Add a non-linear projection head and better data augmentation
DINO: do we need negatives? (very recent)
- Reformulates contrastive learning and a distillation problem
- Teacher model is like the momentum encoder
  - Running average of the student model
  - Sees a global view augmentation of the image
- Student model only sees cropped augmentation of the image
DINO training tricks for the teacher:
- Center the data by adding a mean
- Sharpen the distribution towards a certain class - like a temperature
  - Has the effect of making the teacher be a bit of a classifier
DINO V2 relased this week!
Contrastive Predictive Coding (CPC)
- Contrastive: contrast between correct and incorrect sequences using contrastive learning
- Predictive: model must predict future patterns based on current
- Coding: model must learn useful feature vectors
We give the model context, then a correct continuation sequence and many incorrect possible sequences
First encode all images into vectors: $z_t = g_{enc}(x_t)$
Summarize all context into a conctext code: $c_t$ using an autoregressive model: $g_{ar}$
Compute InfoNCE loss between context $c_t$ and future code $z_{t+k}$ using scoring function:
- $s_k(z_{t+k}, c_t) = z^T_{t+k} W_k c_t$
- $W_k$ is a trainable matrix
CLIP is a contrastive learning model
We can sequentially process non-sequential data - think about how we observe an image
Variable length sequences are tough to work with for basic NNs
Recurrent Neural Networks contain an internal “state”, a summary or context of what’s been seen before
RNN formulation: $h_t = f_W(h_{t-1}, x_t)$ (repeat as necessary!)
Typical autoregressive loss function - run everything through, make predictions, then sum/mean the losses from those predictions

Lecture 11 - May 2

Vanilla RNN learns three weight matricies:
- Transforms input
- Transforms context state (sum these last two then use tanh)
- Transforms context state to output
Hidden representation is commonly initialized to zero
- One could learn the initial weight matrix, but not necessary
Sequence length is an assumption we often have to make during training
Encoder-decoder architecture for sequences encodes sequence into a representation, then decodes into a sequence from that representation
- Think language translation, orignal attention paper
One hot vector is a vector of all zeros and a one corresponding to a single class
We use an embedding layer - computationally inexpensive (indexing), but keeps gradient flowing!
Test time - autoregressive model (I wrote about this!), similar to a phone’s text autocomplete
We truncate sequences, these act as minibatches
We get surprising emergent behaviors from sequence modeling
RNN advantages:
- Can process sequential information
- Can use information from many steps back
- Symmetrical processes for each step
RNN disadvantages:
- Recurrent computation is slow
- Infomation is lost over a sequence in practice
- Exploding gradients :(
Image classification: combine RNNs and CNNs
- Take a CNN and remove the classification head
- Use the image representation as the first hidden representation in an RNN
- Repeatedly sample tokens until the <END> token is produced
Question answering: use RNN and CNN to generate representations, learn a compression, then softmax across the language
Agents can learn instructions and actions to take in a lanugage or image based environment (same principles as before)
Multilayer RNNs: stacking layer weights and adding depth
Vanilla RNN gradient flow: look up the derivation!
- Gradients will vanish as the tanh squishes the gradient for each step
- Even without tanh, this problem is repeated - gradients will explode or vanish
- Note: normalization doesn’t work well with RNNs, active field of research
To combat exploding gradients we can use gradient “clipping”, scaling the gradient if its norm is too large
- However, there is no good solution for vanishing gradients
LSTMs (Long Short Term Memory) help solve the problems within RNNs
- LSTMs keep track of two values: a cell memory an next hidden representation
  - Long and short term memory!
- Usually both initialized to zero
LSTMs produce four outputs $4h$ from the hidden state $h$ and the vector from below $x$
- This can be done in parallel through matrix multiplication
- Three of these outputs are passed through a sigmoid nonlinearity, the fourth is passed through a tanh
Cell memory: $c_t = f \cdot c_{t - 1} + i \cdot g$, $h_t = o \cdot tanh(c_t)$
- The “info gate” (output of tanh) determines how much to write to cell: $g$
- The “input gate” $i$ determines whether to write to cell
- The “forget gate” $f$ determines how much to forget or erase from cell
- The “output gate” $o$ determines how much to reveal cell at a timestep
LSTMs create a “highway network” over for gradients to flow back over time through the cell memory
LSTMs preserve information better over time, however there’s no guarantee
Residual (type) connections are very popular and widespread throughout deep learning!
Neural architecture search for LSTM-like architecture was popular in the past, but no more
GRUs (Gated Recurrent Unit) are inspired by LSTMs
- Quite simple and common because of ease of training
LSTMs were quite popular until this year, hwoever transformers have risen in popularity
- Seemingly a new transformer model every day, check out: visual transformer history

Lecture 12 - May 9

Recurrent image captioning is constrained by the size of the image representation vector
Attention: use different context vector at different timesteps
- Use some function to get relationship scores within different elements in a vector
- Softmax to normalize the scores
- Use these scores to create output context vector (multiply by some value vector)
- This is just scaled dot-product attention
Also works for interpretability: allows for bias correction as well
Attention Layer:
- We matmul a query and key to create scores
- Softmax the scores then to normalize
- Multiply the scores by the values, then sum
Query, key, and value come from linear transformations
Notes:
- We need to use masked-attention for sequence problems
- We need to inject positional encodings as attention layers are position-invariant
  - Simply add positional vectors (often sinusoidal)
- We also scale the dot-product by dividing by $\sqrt{d}$
  - Important for autoregressive problems
  - We love torch.triu
Self attention doesn’t require separate queries and values, they come from the same input
Multi-head self attention:
- Split the dimensionality up, use attention for each part, then concatenate
One of the biggest issues: attention is an $n^2$ memory requirement
Transformers: sequence to sequence model (encoder-decoder)
- Encoder:
  - Made up of multiple “encoder blocks”
  - Each encoder block has a multi-head self-attention block (with residual connection), layernorm, MLP (with residual connection), then a second layer norm
- Decoder:
  - Made up of multiple “decoder blocks”
  - Masked multi-head self attention (residual), layernorm, cross attention (with encoder output and residual), layernorm, MLP (residual), second layernorm
CNNs are often replaced by transformers which split data up into patches
- Good if you have a lot of data, but small data you want to use CNNs
- Vision transformer (ViT)
Image captioning can work now with purely transformer-based architectures
Transformer (size) history: (see slides)
- Started with 12 layers, 8 heads, 65M params

Lecture 13 - May 11

LeNet, small network of convolutional layers
- We pool to downsample images
AlexNet: bigger model
- Trained individually on multiple GPUs
- First use of ReLU
- Data augmentation
- Dropout 0.5
- 7 model ensamble
VGGNet: smaller filters, deeper network
GoogLeNet: added multiple paths between layers (InceptionNet)
- In the future, 1x1 convolutions were added to downscale filter dimensions (reduce computational costs)
- Also add auxiliary classification outputs earlier to keep gradient flow
ResNet: add residual connections between layers
- Keeps gradients flowing
- Also allows model to find the difference
- All current networks start with a conv layer
- Dropout, Kaiming init
ViT: Vision Transformer
- Add a convolution in the first layer to create patches
- Then just run through transformer blocks
- ViT needs more data
- Trained on a dataset called JFT-300M
  - ViT performs worse than ResNet on 10M images
- Final layer is finetuned on ImageNet-1.5M
MLP-Mixer: all MLP architecture
- Full of Mixer Layers
- Mixer layer is layer norm, transpose to get patches, MLP, transpose to get channels, layernorm, second MLP
ResNet improvements:
- Change normalization ordering
- Wider (more filters) networks
- ResNeXT: multiple paths inception-style
- DenseNet: add more residual pathways
- MobileNet: use channel downsampling 1x1 conv layers
- Neural Architecture Search: NAS
  - EfficientNet: fast, accurate, small