CSE 493  Deep Learning
 This course is heavily based off of CS231n
 This course will primarily focus on NLP and Computer Vision
 Course is multiple parts:
 Deep learning fundamentals
 Practical training skills
 Application
 Vision has been one of the drivers of early DL, important to understand the history
 Books can be helpful, but not necessary
 Gradescope has automatic and private testing
 Three psets, one midterm, and a final group project
Lecture 1  March 28
 543 million years ago vision animals started to develop sight
 Camera obscura developed in 1545 to study eclipses
 Inspired da Vinci to create the pinhole camera
 1959 Hubel & Wiesel found that we visually react to “edges” and “blobs”
 Think of this a “lower layer”
 Larry Roberts is known as the “Father of Computer Vision”  wrote the first CV thesis
 1960’s MIT attempted to solve vision in a summer  this didn’t happen
 David Marr introduced the idea of stages of visual recognition in 1970’s
 Edge detection became the next big push in CV
 In the 1980’s expert systems became popular
 These had heuristic rules made by domain “experts”
 Unsuccessful and caused the second AI Winter
 Irving Biederman came up with rules on how we view the world
 1: We must understand components (objects and relationships)
 2: This is only possible because we see so many objects learning
 A 6 year old child has seen 30,000 objects
 We can detect animals in 150 ms
 We detect predators and the color red even quicker!
 Laterstage neurons allowed us to detect complex object or themes
 In the 1990’s research started on start on realworld images
 Algorithms were developed for grouping (1990’s) and matching (2000’s)
 In 2001 the first commercial success in CV
 Facial detection, used ML and facial features
 In the 2000’s feature development was all the rage
 Histogram of gradients  how do the edges in pixels move?
 We need an incredible amount of data  led to ImageNet
 2009, had 22K categories and 14M images
 In 2012 AlexNet had breakthrough performance on ImageNet
 By 2015 all attempts were DL and better than humans
 In 1957 the Mark I Perception was created for character recognition
 Manually twisted knobs to tune (adjusted weights)
 Cannot be trained practically
 Backpropagation was developed in 1986
 LeNet is the architecture used in the Postal Service  1998
 AlexNet is the same architecture
 DL was used in the early 2000’s to compress images
 Everything is homogenized now
 Transformers and backprop are the norm
 Data and compute are the differentiators
 Domains change, but core is often the same
 Hinton, Bengio, and LeCun won the Turing award in 2018
 Deep learning is it’s own course because of incredible growth
Lecture 2  March 30
 Image classification (IC) is a core task in CV
 There are many challenges related to computer vision:
 Viewpoint variation
 Illumination
 Background clutter
 Occlusion
 Deformation
 Intraclass variation
 It is very difficult to implement an image classifier as “normal” software
 “old” AI
 Datadriven paradigm is better
 Use datasets to train a model
 MNIST
 CFIAR 10(0)
 ImageNet
 MIT Places
 Omniglot
 Use ML to train a classifier
 Evaluate the classifier on new images
 Use datasets to train a model
 Nearest Neighbor classifier
 Training: memorize all data and labels
 Inference: predict label of “nearest” image
 $O(1)$ train time, $O(n)$ inference time
 It is a universal approximator with n > infinity data points
 What is “nearest” (distance)?
 L1 norm is bad (Manhattan)
 L2 norm is good (Euclidean)
 Hyperparaamters are choices (configs) of the model
 e.g. k, distance type
 Finding hyperparams
 We should use a train, val, and test ds
 Cross validation: split data into folds  no dedicated val necessary
 Curse of dimensionality: the number of data points necessary is related to the dimension of data
 Linear classifiers take on the form: $y = Wx + b$
 $b$ is often omitted, instead appending data vector with a one
 Parametric approach
 Learns a template and decision boundary
Lecture 3  April 4
 It’s cool that we have a classifier, but how do we make it good?
 Loss function: define how good or bad our classifier (weights) are
 Loss over dataset is the average of loss over examples
 $x_i$, $y_i$, and $L_i$ are example data, output label, and example loss
 Multiclass SVM loss:
 Make sure that prediction of correct label is greater than all predictions for other labels by a given margin
 Delta doesn’t matter because it will simply scale
 Linearish loss
 Issues because piecewise function, doesn’t approach zero asymptotically
 Squaring losses leads to predictions being penalized by more by a factor of incorrectness
 Regularization is important, we always want the simplest model (no overfitting!)
 Often adding a fraction of the L1 or L2 on the weights.
 Spread out the weights
 Softmax classifier:
 Pushes each value between 0 and 1
 Ensures the sum of the softmaxed outputs is 1
 $S_i = \frac{e^{y_i}}{\sum e^{y_i}}$
 Scaling will determine how “peaky” the softmaxed scores are
 NLL is the negative log of the prediction of the correct label
 Cross entropy loss is the NLL of the softmax of the prediction of the correct label
 Optimization is gradient descent
 Find the partial derivatives of the loss with respect to the weights
 We update each weight by the scaled negative of its partial derivative gradient
 Use a numeric gradient to gradient check
Lecture 4  April 6
 Stochastic gradient descent uses minibatches to estimate the gradient
 Pick your GPU, then biggest minibatch size typically
 Linear classifiers aren’t very powerful (for many problems)
 They only learn one template and linear decision boundaries
 We can extract features, then fit a linear classifier
 Nonlinear transformations (features) are necessary
 We have to manually create features…if only we could learn them…
 Simple twolayer NN: $f = W_2 max(0, W_1x)$
 $x \in \mathbb{R}^D, W_1 \in \mathbb{R}^{H \times D}, W_2 \in \mathbb{R}^{C \times H}$
 We can think of this as learning templates, then learning combinations of templates
 There are many different activation functions:
 ReLU, Sigmoid, Leaky ReLU are (historically) popular ones
 Activation functions are necessary because multiple consecutive linear transformations can be represented as just one transformation
 Biological neurons are quite different!
 We use computational graphs and backpropagation to differentiate (backwards pass)
 Backprop is simply the chain rule  a lot
Lecture 5  April 11
 Vectorvector backprop includes the Jacobian matrix for local operations
 ImageNet helped people realize that data was super important!
 Convolutional NNs are a way of including spatial information
 CNNs are ubiquitous within vision
 While FC nets flatten images, CNNs preserve spatial structure
 Filters must be the same depth as the input image (i.e. 3 for RGB)
 Slide over the entire image, flatten each part of the image it is above, dot product with filter
 Stride is the offset between each filtercomparison
 The number of filters is the number of activation weights (and the number of output channels)
 ConvNet is simply many convolutional layers with activation functions in between
 Earlier layers learn lowlevel features, later layers learn higherlevel features
 Ends with a simple FC classifier
 There are interspersed pooling layers to downsample
 Output size of an $N \times N$ image with filter size $F \times F$ is: $(N  F) / stride + 1$
 Often images will be padded with zero pixels
 1x1 convolutional layers increase or reduce depth in channel space
Lecture 6  April 13
 Training loop:
 Sample a batch of data
 Forward prop through the graph to get loss
 Backprop through the gradients
 Update the params using the gradient
 Before you train:
 Activation functions
 Preprocessing
 Weight initialization
 Regularization
 Gradient checking
 Training dynamics:
 Babysitting the learning process
 Param updates
 Hyperparam updates
 Evaluation:
 Model ensamples
 Testtime augmentation
 Transfer learning
 Sigmoid function: $\sigma(x) = \frac{1}{1 + e^{x}}$
 Used to be popular
 Saturated gradient at high positive and negative values
 All gradients squashed by at least a factor of 4
 Outputs aren’t zerocentered which means local gradient is always positive
 Weights will now all change in the same direction
 Computationally expensive!
 Tanh function:
 Zero centered
 Still has the problem of dying gradients
 ReLU (rectified linear unit): $f(x) = max(0, x)$
 Most common activation function
 No saturation in positive region
 Computationally efficient
 Converges (6x) faster than sigmoid or tanh in practice
 Not zero centered
 Can lead to “dying” ReLUs
 To prevent, often initialize biases with small positive value (e.g. 0.01)
 Leaky ReLU: $f(x) = max(0.01x, x)$
 Same a ReLU, but minor gradient in negative region
 Parametric Rectifier (PReLU) where scaling is a learnable parameter
 Exponential Linear Units (ELU):
 “Better”, but more expensive
 Scaled exponential linear units (SELU):
 Works better for larger networks, has a normalizing property
 Can use without BatchNorm
 Holy Heck Math
 “Cool”
 Maxout:
 Max of multiple linear layers
 Multiplies the number of parameters :(
 Swish:
 RLcreated activation function
 GeLU:
 Add some randomness to ReLU, then take average to find this
 “Data dependent dropout”
 Main activation function used (esp. in transformers)
 Add some randomness to ReLU, then take average to find this
 Use ReLU, GeLU if transformers, and try ReLU derivatives
 We often zeromean and normalize our data as preprocessing
 Sometimes preprocessing involves PCA and whitening
 ResNet subtracted mean across channels and normalized with standard deviation
 A constant weight initialization leads to all the values being the same
 Initializations that are too large or too small lead to extreme saturation or clustered outputs
 “Xavier” initialization: $std = \frac{1}{\sqrt{D_{in}}}$
 Good with Tanh
 Attempts to keep output variance similar to input variance
 “Kaiming” initialization: $std = \sqrt{\frac{2}{D_{in}}}$
 Works well with ReLU!
 Batch Normalization:
 “Things break when inputs don’t have zero mean and $std = 1$”  “Why not just force that?”
 Subtract by batch mean, divide by square root of the variance of the data
 It is differentiable!
 We keep these before each nonlinearity
 We also have two learned parameters, gamma and beta corresponding to scaling and shifting
 We keep a running mean of mean and variance across our training process
Lecture 7  April 18
 BatchNorm
 Becomes a linear layer during inference
 “Resets” the standard deviation and mean after change from linear layers
 Allows higher learning rate and better gradient flow
 Acts as regularization during training
 Different behavior during training and testing! Can be a common bug!
 For CNNs, batch norm across channels (output is: $1 \times C \times 1 \times 1$)
 LayerNorm:
 Normalize across each example!
 Instance normalization is for CNN across height and width
 Good for segmentation and detection
 There are a ton of ways to normalize in similar ways
 Vanilla gradient descent: calculate gradient, move towards the negative gradient
 Issues related to getting stuck in saddle points
 Stochastic, so descent can be extremely noisy
 Momentum keeps optimization moving in the same direction:
 $v_{t+1} = \rho v_t + \nabla f(x_t)$ then update using $v_{t+1}$
 Common values for $\rho$ are $0.9$ and $0.99$
 Momentum will often overshoot minima
 Nesterov momentum: take the velocity update before taking derivative
 Good, but we have to update weights twice, so we use a different formulation where $\tilde{x}_t = x_t + \rho v_t$:
 SGD + momentum or Nesterov are often what is used in practice
 AdaGrad sums squared squared gradients, dividing each gradient by the square root of the sum of its squared derivatives:
 This leads to weights with large and small gradients getting updated slower and quicker
 It also quickly decays all learning rates to zero  a problem for neural networks
 RMSProp (Leaky AdaGrad) is AdaGrad but with a decay to previous squared grads  similar to a running average
 Keeps step sizes relatively constant
 Common decay rate is $0.99$
 Doesn’t overshoot as much
 More computationally expensive
 Adam: combine the best of RMSProp and AdaGrad
first_moment = 0 second_moment = 0 for t in range(1, num_iterations): dx = compute_gradient(x) first_moment = beta1 * first_moment + (1  beta1) * dx second_moment = beta2 * second_moment + (1  beta2) * dx * dx first_unbias = first_moment / (1  beta1 ** t) second_unbias = second_moment / (1  beta2 ** t) x = learning rate * first_unbias / (np.sqrt(second_unbias) + 1e7)
 There is a bias correction to prevent large early step sizes from instantly destroying initialization
 Typical hyperparams are
beta1 = 0.9
andbeta2 = 0.99
 L2 regularization and weight decay are the same when using SGD (with momentum), but different for Adam, AdaGrad, RMSProp
 AdamW: the goto optimizer
 Adam with decoupled weight decay and $L_2$ regularization
 Allows user to choose if they want weights to be a part of the second moment or not
 There are secondorder optimization techniques where you move in the negative direction of the Hessian
 This is typically intractable due to the number of parameters
 AdaGrad is actually a special case of a secondorder optimization technique where we assume the Hessian is diagonal
 Secondorder optimization is not typically used in practice
Lecture 8  April 20
 Learning rate decay: scaling down the learning rate over time
 Necessary to decrease loss beyond a certain point
 Hyperparameter choice is extremely important
 There are many learning rate schedulers:
 “Step” down the learning rate after a fixed number of epochs
 Leads to massive drops in loss followed by plateaus repeatedly
 Cosine learning rate decay: $\alpha_t = \frac{1}{2}\alpha_0(1+cos(\frac{t\pi}{T}))$
 $\alpha_0$: initial learning rate, $\alpha_t$ learning rate at step $t$, $T$ total number of steps
 Constant decrease in loss over time
 Linear learning rate decay
 Inverse square root decay
 Constant learning rate decay
 “Step” down the learning rate after a fixed number of epochs
 Learning rate warmup: spending first iterations increasing learning rate
 A large initial learning rate will cause our weight initialization to blow up
 Linearly increase learning rate over about 5 epochs (~5000 iterations)
 If you increase batch size by $N$, also increase learning rate by $N$
 Validation data is a good way of paying attention to the model
 Early stopping is ending model training when test loss plateaus
 Often for ~5 epochs
 Training multiple models and then averaging results is a model ensemble:
 Exhibits about ~2% better performance in the real world
 Different models overfit to different parts of the dataset, it is averaged out
 Regularization techniques previously covered: L2, L1, weight decay
 Dropout: every forward pass, set parameters to zero with probability $p$
 Increases redundancy across the entire network, prevents some overfitting
 Common value of $p = 0.5$
 An interpretation of dropout is an ensemble of models with shared parameters
 At test time, multiply by $p$ or multiply by $\frac{1}{p}$ after each test example (inverted dropout)
 A common pattern with regularization is adding randomness during training and averaging out during testing
 Seen in batch norm and dropout for example
 Data augmentation is a common form of regularization:
 Transforming the input in such a way that the label is the same
 Some example image augmentation techniques:
 Flipping an image horizontally
 Random crops and scales of an image
 Testing: average a fixed set of crops of a test image
 Change contrast or color of images
 Stretching or contorting images
 We are now training models to learn good data augmentation techniques
 DropConnect: set connections to weights to zero (instead of weights in dropout)
 Fractional pooling pools random regions of each image
 Testing: average predictions over multiple regions
 Stochastic depth: skip entire layers in a network using residual connections
 Cutout: randomly set parts of image to average image color
 Good for small datasets, not often used on large ones
 Mixup: blend both training images and training labels by an amount
 Why does it work? Who cares!
 CutMix: replace random crops of one image with another while combining labels
 Label smoothing: set target label to: $1  \frac{K1}{K}\epsilon$ and other labels to: $\frac{\epsilon}{K}$ for $K$ classes
 In practice:
 Use dropout for FC layers
 Using batchnorm is always good
 Try Cutout, MixUp, CutMix, Stochastic Depth, Label Smoothing for (a little) extra performance
 Grid search is okay for hyperparameter search
 Use loglinear values
 Random search is better:
 Use loguniform randomness in given range
 Likely because some hyperparamters matter more than others, so more opportunities to get it exact than grid search
 How to choose hyperparameters without Googlelevel compute:
 Check initial loss: sanity check, turn off weight decay
 Overfit on a small sample (510 minibatches)
 Make some architectural decisions
 Loss not decreasing? LR too low, bad weight init
 Loss NaN or Inf? LR too high, also bad weight init
 Search for learning rate for ~100 iterations
 Good LR to think about: 1e1 to 1e4
 Coarse search: add other hyperparams and train a few models for ~1.5 epochs
 Good weight decay to try: 1e4, 1e5, 0
 Pick best models from 4., train for ~1020 epochs without learning rate decay
 Look at learning curves and adjust
 Might need early stopping, adjust regularization, larger model, or keep going
 Flat start to loss graph means bad initialization
 When learning rate plateaus, add a scheduler (cosine!)
 Use a “command center” like Weights and Biases (or add your own!)
 Return to step 5!
 Linear classifiers are easy to visualize because they have only one filter
 Early layers in deep learning networks are also easy to visualize, these are filters
 Often edge detection
 The last layers are an embedded representation of the inputs
 The embeddings are much better for KNN
 Google search:
 Embeds search into a 100 or 200 dimensional vector, then runs KNN on a massive database
 Needs massive compute however
 We can use PCE or tSNE to visualize the outputs of the last layer of a network
 Good to use KNN rather than simply labels so we can see what is happening under the hood
 We can visualize activation maps for CNNs
 To find what activates neurons the best, run patches of images from the dataset through the model and see which output the full image’s label
 Occlusion using patches is another way of visualizing which pixels matter, can be graphed for a cool image
 Shapley values are a way of using multiple patches to determine important areas of the image
 Compute backprop to the images to find activations for saliency maps
 Super good image segmentation (accidentally)
Lecture 8  April 25
 Saliency maps allow us to view biases within misclassifications
 Clamp gradients to only negative values
 We can also “backprop” a gray image to make an example image
 Gradient ascent and a heck of a lot of regularization
 Adversarial examples:
 Start from an arbitrary image
 Pick an arbitrary class
 Modify the image to improve class scores
 Repeat!
 Black box attacks:
 Adding random noise to images and models get confused…
 Supervised learning is insanely expensive:
 Labeling ImageNet’s 1.4M images would cost more than $175,000
 Unsupervised learning: model isn’t told what to predict
 SelfSupervised learning: model predicts some naturally occurring signal in the data
 The goal is to learn a cheap “pretext” task which learns important features
 Target is something that is easy to compute
 For example, start with autoencoder, then finetune
 Three main types of pretext tasks:
 Generative
 Autoencoders, GANs
 Discriminative
 Contrastive
 Multimodal
 Input video, output audio
 Generative
 Pretext task performance is irrelevant
 Often just toss a linear classifier on the back of the encoder
 Generative supervised learning:
 Generate some data from an example
 Computers (NNs) are not rotationinvariant:
 Allows us to predict rotation as a pretext task
 The learned attentions are similar as to what supervised learning finds
 Predict relative patch locations:
 Break image into 3x3 grid, pass in center square and an outer square and predict location
 CNNs are size invariant
 Jigsaw puzzle:
 Reorder patches according to correct permutation
 Inpainting:
 Pass in image with missing patch, predict the missing patch
 Adding adversarial loss increases image recreation quality
 Image coloring:
 Use “LAB” coloring, pass in L (grayscale) and predict AB (color)
 Splitbrain autoencoder:
 Predict color from light, predict light from color
 Video coloring:
 Given colored start frame and grayscale images from then on, predict the colors
 Uses attention mechanism
 Contrastive learning:
 Create many different examples from an orignal example
 These examples are going to be in the same sematic space
 Learn which examples are closer or further away from each other
Lecture 10  April 27
 Contrastive learning:
 Get a reference example $x$
 Create transformed or augmented examples from $x$, called $x^+$
 Label all other examples $x^$
 Maximize $score(f(x), f(x^+))$ and minimize $score(f(x), f(x^))$
 Loss function is derived from softmax
 “Every single instance is it’s own class”
 Uses cosine similarity

$\frac{u^T v}{ u v }$

 Generate positive samples through simple data augmentation
 SimCLR Framework:
 Make two simple transformations from the example
 Run each transformed example through a NN to get representation
 Run each representation through a simple linear classifier or MLP
 Maximize the cosine similarity of those outputs
 Example transformations for images:
 Random cropping
 Random color distortion
 Distortion blur
 Create a minibatch matrix ($2N$ by $2N$) of alternating exampletransformed images:
 Run it through the model
 Take the cosine similarity of the matrix with itself
 The $(2k, 2k+1)$ and $(2k+1, 2k)$ scores should be positive, everything else should be negative
 Diagonal will always be 1
 See slides for illustration
 SimCLR works extremely well with very large batch sizes (64,000+)
 We don’t want to directly expose the representation to the loss function, so we use a MLP head
 MoCo: Momentum Contrastive Learning
 Keep a running queue of keys (negative examples) for all images in the batch
 If we have 1000 examples and 2000 negatives, it is $1000 \cdot 2000$
 Update the encoder only through the queries (reference images)
 Makes the momentum encoder be much less computationally expensive
 We update the momentum encoder via: $\theta_k \leftarrow m \theta_k + (1  m) \theta_q$
 Slowly aligns the two networks
 Uses crossentropy loss
 Treats each negative as a class
 One correct label (this comes from the two parallel transformations) and the incorrect labels (negatives) are from the queue
 These don’t need to be run in parallel or there could be a concatenation
 Keep a running queue of keys (negative examples) for all images in the batch
 MoCo V2: Add a nonlinear projection head and better data augmentation
 DINO: do we need negatives? (very recent)
 Reformulates contrastive learning and a distillation problem
 Teacher model is like the momentum encoder
 Running average of the student model
 Sees a global view augmentation of the image
 Student model only sees cropped augmentation of the image
 DINO training tricks for the teacher:
 Center the data by adding a mean
 Sharpen the distribution towards a certain class  like a temperature
 Has the effect of making the teacher be a bit of a classifier
 DINO V2 relased this week!
 Contrastive Predictive Coding (CPC)
 Contrastive: contrast between correct and incorrect sequences using contrastive learning
 Predictive: model must predict future patterns based on current
 Coding: model must learn useful feature vectors
 We give the model context, then a correct continuation sequence and many incorrect possible sequences
 First encode all images into vectors: $z_t = g_{enc}(x_t)$
 Summarize all context into a conctext code: $c_t$ using an autoregressive model: $g_{ar}$
 Compute InfoNCE loss between context $c_t$ and future code $z_{t+k}$ using scoring function:
 $s_k(z_{t+k}, c_t) = z^T_{t+k} W_k c_t$
 $W_k$ is a trainable matrix
 CLIP is a contrastive learning model
 We can sequentially process nonsequential data  think about how we observe an image
 Variable length sequences are tough to work with for basic NNs
 Recurrent Neural Networks contain an internal “state”, a summary or context of what’s been seen before
 RNN formulation: $h_t = f_W(h_{t1}, x_t)$ (repeat as necessary!)
 Typical autoregressive loss function  run everything through, make predictions, then sum/mean the losses from those predictions
Lecture 11  May 2
 Vanilla RNN learns three weight matricies:
 Transforms input
 Transforms context state (sum these last two then use tanh)
 Transforms context state to output
 Hidden representation is commonly initialized to zero
 One could learn the initial weight matrix, but not necessary
 Sequence length is an assumption we often have to make during training
 Encoderdecoder architecture for sequences encodes sequence into a representation, then decodes into a sequence from that representation
 Think language translation, orignal attention paper
 One hot vector is a vector of all zeros and a one corresponding to a single class
 We use an embedding layer  computationally inexpensive (indexing), but keeps gradient flowing!
 Test time  autoregressive model (I wrote about this!), similar to a phone’s text autocomplete
 We truncate sequences, these act as minibatches
 We get surprising emergent behaviors from sequence modeling
 RNN advantages:
 Can process sequential information
 Can use information from many steps back
 Symmetrical processes for each step
 RNN disadvantages:
 Recurrent computation is slow
 Infomation is lost over a sequence in practice
 Exploding gradients :(
 Image classification: combine RNNs and CNNs
 Take a CNN and remove the classification head
 Use the image representation as the first hidden representation in an RNN
 Repeatedly sample tokens until the
<END>
token is produced
 Question answering: use RNN and CNN to generate representations, learn a compression, then softmax across the language
 Agents can learn instructions and actions to take in a lanugage or image based environment (same principles as before)
 Multilayer RNNs: stacking layer weights and adding depth
 Vanilla RNN gradient flow: look up the derivation!
 Gradients will vanish as the tanh squishes the gradient for each step
 Even without tanh, this problem is repeated  gradients will explode or vanish
 Note: normalization doesn’t work well with RNNs, active field of research
 To combat exploding gradients we can use gradient “clipping”, scaling the gradient if its norm is too large
 However, there is no good solution for vanishing gradients
 LSTMs (Long Short Term Memory) help solve the problems within RNNs
 LSTMs keep track of two values: a cell memory an next hidden representation
 Long and short term memory!
 Usually both initialized to zero
 LSTMs keep track of two values: a cell memory an next hidden representation
 LSTMs produce four outputs $4h$ from the hidden state $h$ and the vector from below $x$
 This can be done in parallel through matrix multiplication
 Three of these outputs are passed through a sigmoid nonlinearity, the fourth is passed through a tanh
 Cell memory: $c_t = f \cdot c_{t  1} + i \cdot g$, $h_t = o \cdot tanh(c_t)$
 The “info gate” (output of tanh) determines how much to write to cell: $g$
 The “input gate” $i$ determines whether to write to cell
 The “forget gate” $f$ determines how much to forget or erase from cell
 The “output gate” $o$ determines how much to reveal cell at a timestep
 LSTMs create a “highway network” over for gradients to flow back over time through the cell memory
 LSTMs preserve information better over time, however there’s no guarantee
 Residual (type) connections are very popular and widespread throughout deep learning!
 Neural architecture search for LSTMlike architecture was popular in the past, but no more
 GRUs (Gated Recurrent Unit) are inspired by LSTMs
 Quite simple and common because of ease of training
 LSTMs were quite popular until this year, hwoever transformers have risen in popularity
 Seemingly a new transformer model every day, check out: visual transformer history
Lecture 12  May 9
 Recurrent image captioning is constrained by the size of the image representation vector
 Attention: use different context vector at different timesteps
 Use some function to get relationship scores within different elements in a vector
 Softmax to normalize the scores
 Use these scores to create output context vector (multiply by some value vector)
 This is just scaled dotproduct attention
 Also works for interpretability: allows for bias correction as well
 Attention Layer:
 We matmul a query and key to create scores
 Softmax the scores then to normalize
 Multiply the scores by the values, then sum
 Query, key, and value come from linear transformations
 Notes:
 We need to use maskedattention for sequence problems
 We need to inject positional encodings as attention layers are positioninvariant
 Simply add positional vectors (often sinusoidal)
 We also scale the dotproduct by dividing by $\sqrt{d}$
 Important for autoregressive problems
 We love
torch.triu
 Self attention doesn’t require separate queries and values, they come from the same input
 Multihead self attention:
 Split the dimensionality up, use attention for each part, then concatenate
 One of the biggest issues: attention is an $n^2$ memory requirement
 Transformers: sequence to sequence model (encoderdecoder)
 Encoder:
 Made up of multiple “encoder blocks”
 Each encoder block has a multihead selfattention block (with residual connection), layernorm, MLP (with residual connection), then a second layer norm
 Decoder:
 Made up of multiple “decoder blocks”
 Masked multihead self attention (residual), layernorm, cross attention (with encoder output and residual), layernorm, MLP (residual), second layernorm
 Encoder:
 CNNs are often replaced by transformers which split data up into patches
 Good if you have a lot of data, but small data you want to use CNNs
 Vision transformer (ViT)
 Image captioning can work now with purely transformerbased architectures
 Transformer (size) history: (see slides)
 Started with 12 layers, 8 heads, 65M params
Lecture 13  May 11
 LeNet, small network of convolutional layers
 We pool to downsample images
 AlexNet: bigger model
 Trained individually on multiple GPUs
 First use of ReLU
 Data augmentation
 Dropout 0.5
 7 model ensamble
 VGGNet: smaller filters, deeper network
 GoogLeNet: added multiple paths between layers (InceptionNet)
 In the future, 1x1 convolutions were added to downscale filter dimensions (reduce computational costs)
 Also add auxiliary classification outputs earlier to keep gradient flow
 ResNet: add residual connections between layers
 Keeps gradients flowing
 Also allows model to find the difference
 All current networks start with a conv layer
 Dropout, Kaiming init
 ViT: Vision Transformer
 Add a convolution in the first layer to create patches
 Then just run through transformer blocks
 ViT needs more data
 Trained on a dataset called JFT300M
 ViT performs worse than ResNet on 10M images
 Final layer is finetuned on ImageNet1.5M
 MLPMixer: all MLP architecture
 Full of Mixer Layers
 Mixer layer is layer norm, transpose to get patches, MLP, transpose to get channels, layernorm, second MLP
 ResNet improvements:
 Change normalization ordering
 Wider (more filters) networks
 ResNeXT: multiple paths inceptionstyle
 DenseNet: add more residual pathways
 MobileNet: use channel downsampling 1x1 conv layers
 Neural Architecture Search: NAS
 EfficientNet: fast, accurate, small