CSE 493 - Deep Learning
- This course is heavily based off of CS231n
- This course will primarily focus on NLP and Computer Vision
- Course is multiple parts:
- Deep learning fundamentals
- Practical training skills
- Application
- Vision has been one of the drivers of early DL, important to understand the history
- Books can be helpful, but not necessary
- Gradescope has automatic and private testing
- Three psets, one midterm, and a final group project
Lecture 1 - March 28
- 543 million years ago vision animals started to develop sight
- Camera obscura developed in 1545 to study eclipses
- Inspired da Vinci to create the pinhole camera
- 1959 Hubel & Wiesel found that we visually react to “edges” and “blobs”
- Think of this a “lower layer”
- Larry Roberts is known as the “Father of Computer Vision” - wrote the first CV thesis
- 1960’s MIT attempted to solve vision in a summer - this didn’t happen
- David Marr introduced the idea of stages of visual recognition in 1970’s
- Edge detection became the next big push in CV
- In the 1980’s expert systems became popular
- These had heuristic rules made by domain “experts”
- Unsuccessful and caused the second AI Winter
- Irving Biederman came up with rules on how we view the world
- 1: We must understand components (objects and relationships)
- 2: This is only possible because we see so many objects learning
- A 6 year old child has seen 30,000 objects
- We can detect animals in 150 ms
- We detect predators and the color red even quicker!
- Later-stage neurons allowed us to detect complex object or themes
- In the 1990’s research started on start on real-world images
- Algorithms were developed for grouping (1990’s) and matching (2000’s)
- In 2001 the first commercial success in CV
- Facial detection, used ML and facial features
- In the 2000’s feature development was all the rage
- Histogram of gradients - how do the edges in pixels move?
- We need an incredible amount of data - led to ImageNet
- 2009, had 22K categories and 14M images
- In 2012 AlexNet had breakthrough performance on ImageNet
- By 2015 all attempts were DL and better than humans
- In 1957 the Mark I Perception was created for character recognition
- Manually twisted knobs to tune (adjusted weights)
- Cannot be trained practically
- Backpropagation was developed in 1986
- LeNet is the architecture used in the Postal Service - 1998
- AlexNet is the same architecture
- DL was used in the early 2000’s to compress images
- Everything is homogenized now
- Transformers and backprop are the norm
- Data and compute are the differentiators
- Domains change, but core is often the same
- Hinton, Bengio, and LeCun won the Turing award in 2018
- Deep learning is it’s own course because of incredible growth
Lecture 2 - March 30
- Image classification (IC) is a core task in CV
- There are many challenges related to computer vision:
- Viewpoint variation
- Illumination
- Background clutter
- Occlusion
- Deformation
- Intraclass variation
- It is very difficult to implement an image classifier as “normal” software
- “old” AI
- Data-driven paradigm is better
- Use datasets to train a model
- MNIST
- CFIAR 10(0)
- ImageNet
- MIT Places
- Omniglot
- Use ML to train a classifier
- Evaluate the classifier on new images
- Use datasets to train a model
- Nearest Neighbor classifier
- Training: memorize all data and labels
- Inference: predict label of “nearest” image
- $O(1)$ train time, $O(n)$ inference time
- It is a universal approximator with n -> infinity data points
- What is “nearest” (distance)?
- L1 norm is bad (Manhattan)
- L2 norm is good (Euclidean)
- Hyperparaamters are choices (configs) of the model
- e.g. k, distance type
- Finding hyperparams
- We should use a train, val, and test ds
- Cross validation: split data into folds - no dedicated val necessary
- Curse of dimensionality: the number of data points necessary is related to the dimension of data
- Linear classifiers take on the form: $y = Wx + b$
- $b$ is often omitted, instead appending data vector with a one
- Parametric approach
- Learns a template and decision boundary
Lecture 3 - April 4
- It’s cool that we have a classifier, but how do we make it good?
- Loss function: define how good or bad our classifier (weights) are
- Loss over dataset is the average of loss over examples
- $x_i$, $y_i$, and $L_i$ are example data, output label, and example loss
- Multiclass SVM loss:
- Make sure that prediction of correct label is greater than all predictions for other labels by a given margin
- Delta doesn’t matter because it will simply scale
- Linear-ish loss
- Issues because piecewise function, doesn’t approach zero asymptotically
- Squaring losses leads to predictions being penalized by more by a factor of incorrect-ness
- Regularization is important, we always want the simplest model (no overfitting!)
- Often adding a fraction of the L1 or L2 on the weights.
- Spread out the weights
- Softmax classifier:
- Pushes each value between 0 and 1
- Ensures the sum of the softmaxed outputs is 1
- $S_i = \frac{e^{y_i}}{\sum e^{y_i}}$
- Scaling will determine how “peaky” the softmaxed scores are
- NLL is the negative log of the prediction of the correct label
- Cross entropy loss is the NLL of the softmax of the prediction of the correct label
- Optimization is gradient descent
- Find the partial derivatives of the loss with respect to the weights
- We update each weight by the scaled negative of its partial derivative gradient
- Use a numeric gradient to gradient check
Lecture 4 - April 6
- Stochastic gradient descent uses minibatches to estimate the gradient
- Pick your GPU, then biggest minibatch size typically
- Linear classifiers aren’t very powerful (for many problems)
- They only learn one template and linear decision boundaries
- We can extract features, then fit a linear classifier
- Non-linear transformations (features) are necessary
- We have to manually create features…if only we could learn them…
- Simple two-layer NN: $f = W_2 max(0, W_1x)$
- $x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}$
- We can think of this as learning templates, then learning combinations of templates
- There are many different activation functions:
- ReLU, Sigmoid, Leaky ReLU are (historically) popular ones
- Activation functions are necessary because multiple consecutive linear transformations can be represented as just one transformation
- Biological neurons are quite different!
- We use computational graphs and backpropagation to differentiate (backwards pass)
- Backprop is simply the chain rule - a lot
Lecture 5 - April 11
- Vector-vector backprop includes the Jacobian matrix for local operations
- ImageNet helped people realize that data was super important!
- Convolutional NNs are a way of including spatial information
- CNNs are ubiquitous within vision
- While FC nets flatten images, CNNs preserve spatial structure
- Filters must be the same depth as the input image (i.e. 3 for RGB)
- Slide over the entire image, flatten each part of the image it is above, dot product with filter
- Stride is the offset between each filter-comparison
- The number of filters is the number of activation weights (and the number of output channels)
- ConvNet is simply many convolutional layers with activation functions in between
- Earlier layers learn low-level features, later layers learn higher-level features
- Ends with a simple FC classifier
- There are interspersed pooling layers to downsample
- Output size of an $N \times N$ image with filter size $F \times F$ is: $(N - F) / stride + 1$
- Often images will be padded with zero pixels
- 1x1 convolutional layers increase or reduce depth in channel space
Lecture 6 - April 13
- Training loop:
- Sample a batch of data
- Forward prop through the graph to get loss
- Backprop through the gradients
- Update the params using the gradient
- Before you train:
- Activation functions
- Preprocessing
- Weight initialization
- Regularization
- Gradient checking
- Training dynamics:
- Babysitting the learning process
- Param updates
- Hyperparam updates
- Evaluation:
- Model ensamples
- Test-time augmentation
- Transfer learning
- Sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$
- Used to be popular
- Saturated gradient at high positive and negative values
- All gradients squashed by at least a factor of 4
- Outputs aren’t zero-centered which means local gradient is always positive
- Weights will now all change in the same direction
- Computationally expensive!
- Tanh function:
- Zero centered
- Still has the problem of dying gradients
- ReLU (rectified linear unit): $f(x) = max(0, x)$
- Most common activation function
- No saturation in positive region
- Computationally efficient
- Converges (6x) faster than sigmoid or tanh in practice
- Not zero centered
- Can lead to “dying” ReLUs
- To prevent, often initialize biases with small positive value (e.g. 0.01)
- Leaky ReLU: $f(x) = max(0.01x, x)$
- Same a ReLU, but minor gradient in negative region
- Parametric Rectifier (PReLU) where scaling is a learnable parameter
- Exponential Linear Units (ELU):
- “Better”, but more expensive
- Scaled exponential linear units (SELU):
- Works better for larger networks, has a normalizing property
- Can use without BatchNorm
- Holy Heck Math
- “Cool”
- Maxout:
- Max of multiple linear layers
- Multiplies the number of parameters :(
- Swish:
- RL-created activation function
- GeLU:
- Add some randomness to ReLU, then take average to find this
- “Data dependent dropout”
- Main activation function used (esp. in transformers)
- Add some randomness to ReLU, then take average to find this
- Use ReLU, GeLU if transformers, and try ReLU derivatives
- We often zero-mean and normalize our data as preprocessing
- Sometimes preprocessing involves PCA and whitening
- ResNet subtracted mean across channels and normalized with standard deviation
- A constant weight initialization leads to all the values being the same
- Initializations that are too large or too small lead to extreme saturation or clustered outputs
- “Xavier” initialization: $std = \frac{1}{\sqrt{D_{in}}}$
- Good with Tanh
- Attempts to keep output variance similar to input variance
- “Kaiming” initialization: $std = \sqrt{\frac{2}{D_{in}}}$
- Works well with ReLU!
- Batch Normalization:
- “Things break when inputs don’t have zero mean and $std = 1$” - “Why not just force that?”
- Subtract by batch mean, divide by square root of the variance of the data
- It is differentiable!
- We keep these before each non-linearity
- We also have two learned parameters, gamma and beta corresponding to scaling and shifting
- We keep a running mean of mean and variance across our training process
Lecture 7 - April 18
- BatchNorm
- Becomes a linear layer during inference
- “Resets” the standard deviation and mean after change from linear layers
- Allows higher learning rate and better gradient flow
- Acts as regularization during training
- Different behavior during training and testing! Can be a common bug!
- For CNNs, batch norm across channels (output is: $1 \times C \times 1 \times 1$)
- LayerNorm:
- Normalize across each example!
- Instance normalization is for CNN across height and width
- Good for segmentation and detection
- There are a ton of ways to normalize in similar ways
- Vanilla gradient descent: calculate gradient, move towards the negative gradient
- Issues related to getting stuck in saddle points
- Stochastic, so descent can be extremely noisy
- Momentum keeps optimization moving in the same direction:
- $v_{t+1} = \rho v_t + \nabla f(x_t)$ then update using $v_{t+1}$
- Common values for $\rho$ are $0.9$ and $0.99$
- Momentum will often overshoot minima
- Nesterov momentum: take the velocity update before taking derivative
- Good, but we have to update weights twice, so we use a different formulation where $\tilde{x}_t = x_t + \rho v_t$:
- SGD + momentum or Nesterov are often what is used in practice
- AdaGrad sums squared squared gradients, dividing each gradient by the square root of the sum of its squared derivatives:
- This leads to weights with large and small gradients getting updated slower and quicker
- It also quickly decays all learning rates to zero - a problem for neural networks
- RMSProp (Leaky AdaGrad) is AdaGrad but with a decay to previous squared grads - similar to a running average
- Keeps step sizes relatively constant
- Common decay rate is $0.99$
- Doesn’t overshoot as much
- More computationally expensive
- Adam: combine the best of RMSProp and AdaGrad
first_moment = 0 second_moment = 0 for t in range(1, num_iterations): dx = compute_gradient(x) first_moment = beta1 * first_moment + (1 - beta1) * dx second_moment = beta2 * second_moment + (1 - beta2) * dx * dx first_unbias = first_moment / (1 - beta1 ** t) second_unbias = second_moment / (1 - beta2 ** t) x -= learning rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)
- There is a bias correction to prevent large early step sizes from instantly destroying initialization
- Typical hyperparams are
beta1 = 0.9
andbeta2 = 0.99
- L2 regularization and weight decay are the same when using SGD (with momentum), but different for Adam, AdaGrad, RMSProp
- AdamW: the go-to optimizer
- Adam with decoupled weight decay and $L_2$ regularization
- Allows user to choose if they want weights to be a part of the second moment or not
- There are second-order optimization techniques where you move in the negative direction of the Hessian
- This is typically intractable due to the number of parameters
- AdaGrad is actually a special case of a second-order optimization technique where we assume the Hessian is diagonal
- Second-order optimization is not typically used in practice
Lecture 8 - April 20
- Learning rate decay: scaling down the learning rate over time
- Necessary to decrease loss beyond a certain point
- Hyperparameter choice is extremely important
- There are many learning rate schedulers:
- “Step” down the learning rate after a fixed number of epochs
- Leads to massive drops in loss followed by plateaus repeatedly
- Cosine learning rate decay: $\alpha_t = \frac{1}{2}\alpha_0(1+cos(\frac{t\pi}{T}))$
- $\alpha_0$: initial learning rate, $\alpha_t$ learning rate at step $t$, $T$ total number of steps
- Constant decrease in loss over time
- Linear learning rate decay
- Inverse square root decay
- Constant learning rate decay
- “Step” down the learning rate after a fixed number of epochs
- Learning rate warmup: spending first iterations increasing learning rate
- A large initial learning rate will cause our weight initialization to blow up
- Linearly increase learning rate over about 5 epochs (~5000 iterations)
- If you increase batch size by $N$, also increase learning rate by $N$
- Validation data is a good way of paying attention to the model
- Early stopping is ending model training when test loss plateaus
- Often for ~5 epochs
- Training multiple models and then averaging results is a model ensemble:
- Exhibits about ~2% better performance in the real world
- Different models overfit to different parts of the dataset, it is averaged out
- Regularization techniques previously covered: L2, L1, weight decay
- Dropout: every forward pass, set parameters to zero with probability $p$
- Increases redundancy across the entire network, prevents some overfitting
- Common value of $p = 0.5$
- An interpretation of dropout is an ensemble of models with shared parameters
- At test time, multiply by $p$ or multiply by $\frac{1}{p}$ after each test example (inverted dropout)
- A common pattern with regularization is adding randomness during training and averaging out during testing
- Seen in batch norm and dropout for example
- Data augmentation is a common form of regularization:
- Transforming the input in such a way that the label is the same
- Some example image augmentation techniques:
- Flipping an image horizontally
- Random crops and scales of an image
- Testing: average a fixed set of crops of a test image
- Change contrast or color of images
- Stretching or contorting images
- We are now training models to learn good data augmentation techniques
- DropConnect: set connections to weights to zero (instead of weights in dropout)
- Fractional pooling pools random regions of each image
- Testing: average predictions over multiple regions
- Stochastic depth: skip entire layers in a network using residual connections
- Cutout: randomly set parts of image to average image color
- Good for small datasets, not often used on large ones
- Mixup: blend both training images and training labels by an amount
- Why does it work? Who cares!
- CutMix: replace random crops of one image with another while combining labels
- Label smoothing: set target label to: $1 - \frac{K-1}{K}\epsilon$ and other labels to: $\frac{\epsilon}{K}$ for $K$ classes
- In practice:
- Use dropout for FC layers
- Using batchnorm is always good
- Try Cutout, MixUp, CutMix, Stochastic Depth, Label Smoothing for (a little) extra performance
- Grid search is okay for hyperparameter search
- Use log-linear values
- Random search is better:
- Use log-uniform randomness in given range
- Likely because some hyperparamters matter more than others, so more opportunities to get it exact than grid search
- How to choose hyperparameters without Google-level compute:
- Check initial loss: sanity check, turn off weight decay
- Overfit on a small sample (5-10 minibatches)
- Make some architectural decisions
- Loss not decreasing? LR too low, bad weight init
- Loss NaN or Inf? LR too high, also bad weight init
- Search for learning rate for ~100 iterations
- Good LR to think about: 1e-1 to 1e-4
- Coarse search: add other hyperparams and train a few models for ~1.5 epochs
- Good weight decay to try: 1e-4, 1e-5, 0
- Pick best models from 4., train for ~10-20 epochs without learning rate decay
- Look at learning curves and adjust
- Might need early stopping, adjust regularization, larger model, or keep going
- Flat start to loss graph means bad initialization
- When learning rate plateaus, add a scheduler (cosine!)
- Use a “command center” like Weights and Biases (or add your own!)
- Return to step 5!
- Linear classifiers are easy to visualize because they have only one filter
- Early layers in deep learning networks are also easy to visualize, these are filters
- Often edge detection
- The last layers are an embedded representation of the inputs
- The embeddings are much better for KNN
- Google search:
- Embeds search into a 100 or 200 dimensional vector, then runs KNN on a massive database
- Needs massive compute however
- We can use PCE or t-SNE to visualize the outputs of the last layer of a network
- Good to use KNN rather than simply labels so we can see what is happening under the hood
- We can visualize activation maps for CNNs
- To find what activates neurons the best, run patches of images from the dataset through the model and see which output the full image’s label
- Occlusion using patches is another way of visualizing which pixels matter, can be graphed for a cool image
- Shapley values are a way of using multiple patches to determine important areas of the image
- Compute backprop to the images to find activations for saliency maps
- Super good image segmentation (accidentally)
Lecture 8 - April 25
- Saliency maps allow us to view biases within misclassifications
- Clamp gradients to only negative values
- We can also “backprop” a gray image to make an example image
- Gradient ascent and a heck of a lot of regularization
- Adversarial examples:
- Start from an arbitrary image
- Pick an arbitrary class
- Modify the image to improve class scores
- Repeat!
- Black box attacks:
- Adding random noise to images and models get confused…
- Supervised learning is insanely expensive:
- Labeling ImageNet’s 1.4M images would cost more than $175,000
- Unsupervised learning: model isn’t told what to predict
- Self-Supervised learning: model predicts some naturally occurring signal in the data
- The goal is to learn a cheap “pretext” task which learns important features
- Target is something that is easy to compute
- For example, start with autoencoder, then fine-tune
- Three main types of pretext tasks:
- Generative
- Autoencoders, GANs
- Discriminative
- Contrastive
- Multimodal
- Input video, output audio
- Generative
- Pretext task performance is irrelevant
- Often just toss a linear classifier on the back of the encoder
- Generative supervised learning:
- Generate some data from an example
- Computers (NNs) are not rotation-invariant:
- Allows us to predict rotation as a pretext task
- The learned attentions are similar as to what supervised learning finds
- Predict relative patch locations:
- Break image into 3x3 grid, pass in center square and an outer square and predict location
- CNNs are size invariant
- Jigsaw puzzle:
- Reorder patches according to correct permutation
- Inpainting:
- Pass in image with missing patch, predict the missing patch
- Adding adversarial loss increases image recreation quality
- Image coloring:
- Use “LAB” coloring, pass in L (grayscale) and predict AB (color)
- Split-brain autoencoder:
- Predict color from light, predict light from color
- Video coloring:
- Given colored start frame and grayscale images from then on, predict the colors
- Uses attention mechanism
- Contrastive learning:
- Create many different examples from an orignal example
- These examples are going to be in the same sematic space
- Learn which examples are closer or further away from each other
Lecture 10 - April 27
- Contrastive learning:
- Get a reference example $x$
- Create transformed or augmented examples from $x$, called $x^+$
- Label all other examples $x^-$
- Maximize $score(f(x), f(x^+))$ and minimize $score(f(x), f(x^-))$
- Loss function is derived from softmax
- “Every single instance is it’s own class”
- Uses cosine similarity
-
$\frac{u^T v}{ u v }$
-
- Generate positive samples through simple data augmentation
- SimCLR Framework:
- Make two simple transformations from the example
- Run each transformed example through a NN to get representation
- Run each representation through a simple linear classifier or MLP
- Maximize the cosine similarity of those outputs
- Example transformations for images:
- Random cropping
- Random color distortion
- Distortion blur
- Create a minibatch matrix ($2N$ by $2N$) of alternating example-transformed images:
- Run it through the model
- Take the cosine similarity of the matrix with itself
- The $(2k, 2k+1)$ and $(2k+1, 2k)$ scores should be positive, everything else should be negative
- Diagonal will always be 1
- See slides for illustration
- SimCLR works extremely well with very large batch sizes (64,000+)
- We don’t want to directly expose the representation to the loss function, so we use a MLP head
- MoCo: Momentum Contrastive Learning
- Keep a running queue of keys (negative examples) for all images in the batch
- If we have 1000 examples and 2000 negatives, it is $1000 \cdot 2000$
- Update the encoder only through the queries (reference images)
- Makes the momentum encoder be much less computationally expensive
- We update the momentum encoder via: $\theta_k \leftarrow m \theta_k + (1 - m) \theta_q$
- Slowly aligns the two networks
- Uses cross-entropy loss
- Treats each negative as a class
- One correct label (this comes from the two parallel transformations) and the incorrect labels (negatives) are from the queue
- These don’t need to be run in parallel or there could be a concatenation
- Keep a running queue of keys (negative examples) for all images in the batch
- MoCo V2: Add a non-linear projection head and better data augmentation
- DINO: do we need negatives? (very recent)
- Reformulates contrastive learning and a distillation problem
- Teacher model is like the momentum encoder
- Running average of the student model
- Sees a global view augmentation of the image
- Student model only sees cropped augmentation of the image
- DINO training tricks for the teacher:
- Center the data by adding a mean
- Sharpen the distribution towards a certain class - like a temperature
- Has the effect of making the teacher be a bit of a classifier
- DINO V2 relased this week!
- Contrastive Predictive Coding (CPC)
- Contrastive: contrast between correct and incorrect sequences using contrastive learning
- Predictive: model must predict future patterns based on current
- Coding: model must learn useful feature vectors
- We give the model context, then a correct continuation sequence and many incorrect possible sequences
- First encode all images into vectors: $z_t = g_{enc}(x_t)$
- Summarize all context into a conctext code: $c_t$ using an autoregressive model: $g_{ar}$
- Compute InfoNCE loss between context $c_t$ and future code $z_{t+k}$ using scoring function:
- $s_k(z_{t+k}, c_t) = z^T_{t+k} W_k c_t$
- $W_k$ is a trainable matrix
- CLIP is a contrastive learning model
- We can sequentially process non-sequential data - think about how we observe an image
- Variable length sequences are tough to work with for basic NNs
- Recurrent Neural Networks contain an internal “state”, a summary or context of what’s been seen before
- RNN formulation: $h_t = f_W(h_{t-1}, x_t)$ (repeat as necessary!)
- Typical autoregressive loss function - run everything through, make predictions, then sum/mean the losses from those predictions
Lecture 11 - May 2
- Vanilla RNN learns three weight matricies:
- Transforms input
- Transforms context state (sum these last two then use tanh)
- Transforms context state to output
- Hidden representation is commonly initialized to zero
- One could learn the initial weight matrix, but not necessary
- Sequence length is an assumption we often have to make during training
- Encoder-decoder architecture for sequences encodes sequence into a representation, then decodes into a sequence from that representation
- Think language translation, orignal attention paper
- One hot vector is a vector of all zeros and a one corresponding to a single class
- We use an embedding layer - computationally inexpensive (indexing), but keeps gradient flowing!
- Test time - autoregressive model (I wrote about this!), similar to a phone’s text autocomplete
- We truncate sequences, these act as minibatches
- We get surprising emergent behaviors from sequence modeling
- RNN advantages:
- Can process sequential information
- Can use information from many steps back
- Symmetrical processes for each step
- RNN disadvantages:
- Recurrent computation is slow
- Infomation is lost over a sequence in practice
- Exploding gradients :(
- Image classification: combine RNNs and CNNs
- Take a CNN and remove the classification head
- Use the image representation as the first hidden representation in an RNN
- Repeatedly sample tokens until the
<END>
token is produced
- Question answering: use RNN and CNN to generate representations, learn a compression, then softmax across the language
- Agents can learn instructions and actions to take in a lanugage or image based environment (same principles as before)
- Multilayer RNNs: stacking layer weights and adding depth
- Vanilla RNN gradient flow: look up the derivation!
- Gradients will vanish as the tanh squishes the gradient for each step
- Even without tanh, this problem is repeated - gradients will explode or vanish
- Note: normalization doesn’t work well with RNNs, active field of research
- To combat exploding gradients we can use gradient “clipping”, scaling the gradient if its norm is too large
- However, there is no good solution for vanishing gradients
- LSTMs (Long Short Term Memory) help solve the problems within RNNs
- LSTMs keep track of two values: a cell memory an next hidden representation
- Long and short term memory!
- Usually both initialized to zero
- LSTMs keep track of two values: a cell memory an next hidden representation
- LSTMs produce four outputs $4h$ from the hidden state $h$ and the vector from below $x$
- This can be done in parallel through matrix multiplication
- Three of these outputs are passed through a sigmoid nonlinearity, the fourth is passed through a tanh
- Cell memory: $c_t = f \cdot c_{t - 1} + i \cdot g$, $h_t = o \cdot tanh(c_t)$
- The “info gate” (output of tanh) determines how much to write to cell: $g$
- The “input gate” $i$ determines whether to write to cell
- The “forget gate” $f$ determines how much to forget or erase from cell
- The “output gate” $o$ determines how much to reveal cell at a timestep
- LSTMs create a “highway network” over for gradients to flow back over time through the cell memory
- LSTMs preserve information better over time, however there’s no guarantee
- Residual (type) connections are very popular and widespread throughout deep learning!
- Neural architecture search for LSTM-like architecture was popular in the past, but no more
- GRUs (Gated Recurrent Unit) are inspired by LSTMs
- Quite simple and common because of ease of training
- LSTMs were quite popular until this year, hwoever transformers have risen in popularity
- Seemingly a new transformer model every day, check out: visual transformer history
Lecture 12 - May 9
- Recurrent image captioning is constrained by the size of the image representation vector
- Attention: use different context vector at different timesteps
- Use some function to get relationship scores within different elements in a vector
- Softmax to normalize the scores
- Use these scores to create output context vector (multiply by some value vector)
- This is just scaled dot-product attention
- Also works for interpretability: allows for bias correction as well
- Attention Layer:
- We matmul a query and key to create scores
- Softmax the scores then to normalize
- Multiply the scores by the values, then sum
- Query, key, and value come from linear transformations
- Notes:
- We need to use masked-attention for sequence problems
- We need to inject positional encodings as attention layers are position-invariant
- Simply add positional vectors (often sinusoidal)
- We also scale the dot-product by dividing by $\sqrt{d}$
- Important for autoregressive problems
- We love
torch.triu
- Self attention doesn’t require separate queries and values, they come from the same input
- Multi-head self attention:
- Split the dimensionality up, use attention for each part, then concatenate
- One of the biggest issues: attention is an $n^2$ memory requirement
- Transformers: sequence to sequence model (encoder-decoder)
- Encoder:
- Made up of multiple “encoder blocks”
- Each encoder block has a multi-head self-attention block (with residual connection), layernorm, MLP (with residual connection), then a second layer norm
- Decoder:
- Made up of multiple “decoder blocks”
- Masked multi-head self attention (residual), layernorm, cross attention (with encoder output and residual), layernorm, MLP (residual), second layernorm
- Encoder:
- CNNs are often replaced by transformers which split data up into patches
- Good if you have a lot of data, but small data you want to use CNNs
- Vision transformer (ViT)
- Image captioning can work now with purely transformer-based architectures
- Transformer (size) history: (see slides)
- Started with 12 layers, 8 heads, 65M params
Lecture 13 - May 11
- LeNet, small network of convolutional layers
- We pool to downsample images
- AlexNet: bigger model
- Trained individually on multiple GPUs
- First use of ReLU
- Data augmentation
- Dropout 0.5
- 7 model ensamble
- VGGNet: smaller filters, deeper network
- GoogLeNet: added multiple paths between layers (InceptionNet)
- In the future, 1x1 convolutions were added to downscale filter dimensions (reduce computational costs)
- Also add auxiliary classification outputs earlier to keep gradient flow
- ResNet: add residual connections between layers
- Keeps gradients flowing
- Also allows model to find the difference
- All current networks start with a conv layer
- Dropout, Kaiming init
- ViT: Vision Transformer
- Add a convolution in the first layer to create patches
- Then just run through transformer blocks
- ViT needs more data
- Trained on a dataset called JFT-300M
- ViT performs worse than ResNet on 10M images
- Final layer is finetuned on ImageNet-1.5M
- MLP-Mixer: all MLP architecture
- Full of Mixer Layers
- Mixer layer is layer norm, transpose to get patches, MLP, transpose to get channels, layernorm, second MLP
- ResNet improvements:
- Change normalization ordering
- Wider (more filters) networks
- ResNeXT: multiple paths inception-style
- DenseNet: add more residual pathways
- MobileNet: use channel downsampling 1x1 conv layers
- Neural Architecture Search: NAS
- EfficientNet: fast, accurate, small