# CSE 312

## Lecture 1 - March 27

• We will cover probability, combinatorics, and some applications
• It helps with algo analysis
• It is a building block for probability
• A set’s cardinality (size) is the number of distinct elements in it
• Sum Rule: if we want to choose an element between two sets with no overlapping elements, we have the sum of the cardinalities number of options
• Product Rule: when we have a sequential process of choosing elements, the number of possibilities is the sum of the number of options at each step
• This is equal to the size of the Cartesian Product of the sets
• Note that is works independent of which are the elements being chosen from, as long as the number of options is always the same at a step
• What is the size of the power set of $S$?
• $2^{\mid S \mid}$
• Like the product rule, as each element can be in or out of a subset

## Lecture 2 - March 29

• When parsing HW problems, look out for “sequence” and “distinct”
• Remember $0! = 1$
• K Permutation:
• The number of $k$ element sequences of distinct symbols from a universe of $n$ symbols is: $P(n, k) = n (n - 1) \dots (n - k + 1) = \frac{n!}{(n - k)!}$
• Said as “P n K”, “n permute k”, or “n pick k”
• Permutation vs subsets: does order matter?
• K Combination:
• The number of $k$ element subsets from a set of $n$ symbols is: $C(n, k) = \frac{P(n, k)}{k!} = \frac{n!}{k!(n - k)!}$
• Said as “n choose k” typically
• Also written as: $\binom{n}{k}$
• We can think of this as finding the number where ordering matters, then dividing by the number of possible orders for each subset
• We can go back and forth from combinations to permutations by dividing and multiplying by the number of possible orders within a subset (typically $\mid S \mid$)
• Path counting is really just combinations
• Overcounting is cool! (We just have to correct for it)
• e.g. anagrams of “SEATTLE”, we find all permutations, then divide by the product of the factorials of the number of each element
• divide by $2! \cdot 2!$
• Complementary Counting:
• We count all the complements of a set (everything that wouldn’t be valid), then subtract that from the cardinality of the universe!

## Lecture 3 - April 1

• Multinomial coefficient: $\binom{c}{a, b} \equiv \frac{c!}{a! \cdot b!}$
• Counting two ways can be a good way to learn more and understand!
• Symmetry of combinations: $\binom{n}{k} = \binom{n}{n - k}$
• Can prove using algebra
• Can prove using the “team-picking” idea: choosing who is on a team also indirectly chooses who is not on a team, and vice-versa
• Pascal’s Rule: $\binom{n}{k} = \binom{n - 1}{k - 1} + \binom{n - 1}{k}$
• Again, we can prove using algebra
• We can also prove using the idea of the summation of all teams that I’m on and teams that I’m not on
• “Focus on one element”
• Binomial Theorem: $(x + y)^n = \sum_{i = 0}^{n} \binom{n}{i} x^i y^{n-i}$
• We have $xy$’s and $\binom{n}{i}$ determines how many there are
• Useful when we need to plug in specific numbers for x and y
• Principle of Inclusion-Exclusion: when we are trying to get the size of a set which is the OR of multiple conditions, we can take the size of the union of the sets which satisfy at least one condition
• Sum rule for non-disjoint sets: $\mid A \cup B \mid = \mid A \mid + \mid B \mid - \mid A \cap B \mid$
• Sum rule for three non-disjoint sets follows similar logic, subtracting the intersection of each pair of sets, then adding the intersection of all sets
• Typically only use for two or three conditions (maybe four)

## Lecture 4 - April 3

• Pigeonhole Principle
• If there are more items then spots, we know that one spot has more than one item
• Strong Pigeonhole: If we have $n$ pigeons and $k$ pigeonholes, there is at least one pigeonhole with $\frac{k}{n}$ pigeons, rounded up
• Placing dividers (stars and bars):
• If we need to find the number of different groups comprised of different types of elements, use dividers
• For n size groups of k types: $\binom{n+k-1}{k-1}$
• Can think about it as a binary string permutation problem

## Lecture 5 - April 5

• Probability is a way of quantifying our uncertainty
• Sample space, $\Omega$, is all the possible outcomes of an experiment
• Single coin flip: $\Omega = \lbrace H, T \rbrace$
• An event, $E \subseteq \Omega$, is a subset of all possible outcomes
• A probability, $\mathbb{P}: \Omega \rightarrow [0, 1]$, is a function that maps an element of $\Omega$ to it’s likelihood of occurring
• Other notation: $Pr[\omega]$ or $P(\omega)$
• All probabilities must be between 0 and 1 and the sum of the probability for each element in the sample space must sum to 1
• Probability of an event should be the sum of the probabilities of the elements in the event
• A probability space is a pair of sample space and probability function
• Uniform Probability Space: all events have the same likelihood
• Events are “mutually exclusive” if they cannot happen simultaneously
• Axioms:
• Non-negativity: $\mathbb{P}(x) \geq 0$ for all $x$
• Normalization: $\sum_{x \in \Omega} \mathbb{P}(x) = 1$
• Countable additivity: If $E$ and $F$ are mutually exclusive, then: $\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F)$
• Facts derived from axioms:
• Complementation: $\mathbb{P}(\bar{E}) = 1 - \mathbb{P}(E)$
• Monotonicity: if $E \subseteq F$, then $\mathbb{P}(E) \leq \mathbb{P}(F)$
• Inclusion-exclusion: $\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F) - \mathbb{P}(E \cap F)$

## Lecture 6 - April 7

• Often our sample space will contain excess information. This won’t make our answer incorrect, but can lead to unnecessary work
• We use conditional probability when we have partial information
• It is a way to “restrict” the sample space
• $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}$
• This allows us to update our probabilities
• “The probability of A given B is equal to the probability of A and B happening, divided by the probability of B”

## Lecture 7 - April 10

• Bayes’ Rule allows us to use conditional probabilities
• Bayes Rule:
• $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B)}$
• The law of total probability:
• $\mathbb{P}(S) = \mathbb{P}(S \mid G) \cdot \mathbb{P}(G) + \mathbb{P}(S \mid \bar{G}) \cdot \mathbb{P}(\bar{G})$
• More generally: $\sum_{\forall i}\mathbb{P}(E \mid A_i) \mathbb{P}(A_i)$
• The probability of an event happening is equal to the sum of the probability of the event happening given another event happening, multiplied by the probability of the other event
• A partition of the sample space is a family of subsets where each partition is distinct and they combine to be the entire sample space
• Humans are very bad at understanding very large or small numbers - past a certain amount we ignore magnitudes
• It is good to think of tests as an “update” to our priors, not a revelation of truth
• When we condition on an event, we still have probability spaces: $B$ and probability measures: $\mathbb{P}(\omega \mid B)$
• Do not condition on multiple events, only the intersection of them

## Lecture 8 - April 12

• (Statistical) independence is when the probabilities of two things don’t depend on each other:
• $\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B)$
• “Conditioning doesn’t make a difference”
• Chain Rule:
• $\mathbb{P}(A_1 \cap A_2 \cap \dots \cap A_n) = \mathbb{P}(A_n \mid A_1 \cap \dots \cap A_{n-1}) \cdot \mathbb{P}(A_{n-1} \mid A_1 \cap \dots \cap A_{n-2}) \dots \mathbb{P}(A_2 \mid A_1) \cdot \mathbb{P}(A_1)$
• We can find this from: $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \rightarrow \mathbb{P}(A \cap B) = \mathbb{P}(A \mid B) \cdot \mathbb{P}(B)$
• Conditional independence is independence of two events given another event

## Lecture 9 - April 14

• Medical terms for tests:
• $\mathbb{P}(D)$ is “prevalence”
• $\mathbb{P}(T \mid D)$ is “sensitivity”
• $\mathbb{P}(\bar{T} \mid \bar{D})$ is “specificity”
• ONE WEIRD TRICK:
• Think of these problems with a large population (where at least one for each chance)
• Intuition Trick: Bayes’ Factor:
• When you test positive, multiply prior by the Bayes’ Factor: $\frac{\text{sensitivity}}{\text{false positive rate}} = \frac{1 - FNR}{FPR}$
• Also called the “likelihood ratio”
• It is an estimate of how much I should update the prior by
• Better when the prior is quite low
• When there are overwhelming differences between groups, response error can drown out small groups

## Lecture 10 - April 17

• We often implicitly define the sample space:
• This commonly occurs when the size of the sample space is infinite
• When working with sample spaces with infinite size:
• We can use infinite sums of probabilities (and closed forms)
• We can use complement events
• Random Variables:
• It is any function that has domain $\Omega$ and outputs a real number
• $X(\omega)$ is a summary of $\omega$
• It doesn’t change the problem, but it can simplify it!
• One sample space can have many random variables
• We always use capital letters as random variables
• We commonly use lowercase variables as the values the random variable could take on
• Random variables are a function
• We do not use typical function notation, instead something like: $X = 2$
• The support (the range) is the set of values $X$ can take on
• The event the random variable takes on a value: $\lbrace \omega : X(\omega) = x \rbrace$
• The probability of that event is: $\mathbb{P}(X = x)$
• The function that gives us $\mathbb{P}(X = x)$ is the Probability Mass Function, or PMF
• This is written as: $p_X(x)$ or $f_X(x)$
• Think of the PMF as a piecewise function where values outside the support are zero because it must take in all real numbers as an input
• The Cumulative Distribution Function (CDF) gives the probability $X \leq x$
• Written as: $F_X(x) = \mathbb{P}(X \leq x)$

## Lecture 11 - April 19

• CDF will always be defined over all real numbers (we must support all of them)
• We will often use floor functions when we only want to “support” integers
• Typically has zero below the support and one above the support
• The “expectation” of a random variable $X$ is: $\mathbb{E}[X] = \sum_k k \cdot \mathbb{P}(X = k)$
• The “weighted average” of values $X$ could take on
• Weighted by the probability we actually see the value
• Think about it as the expectation of a drive in football:
• $X$ is the possible scores $0, 2, 3, 6, 7$
• We multiply each score by the likelihood that it happens, sum all of these for our expected score
• Not the most common outcome
• The expectation of the sum of two random variables is equal to the sum of the expectations for each variable
• Pairwise independece: for each pair in the set, they are independent
• Mutual independence: for each subset of events, the probability of the intersection of all of them is equal to the product of all individual probabilities of the events
• Stronger than pairwise independence
• Two random variables $X$ and $Y$ are independent if for all $k$ and $l$ (in the supports), $\mathbb{P}(X = k, Y = l) = \mathbb{P}(X = k) \cdot \mathbb{P}(Y = l)$
• Note that commas are often used instead of $\cap$ for random variables
• Pairwise independence is the intuition
• Mutual independence of random variables: $X_1, X_2, \dots, X_n$ are mutually independent if for all $x_1, x_2, \dots, x_n$ $\mathbb{P}(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n) = \mathbb{P}(X_1 = x_1) \cdot \mathbb{P}(X_2 = x_2) \cdot \dots \cdot \mathbb{P}(X_n = x_n)$
• We don’t need to check all subsets, but do need to check all possible values in the ranges
• While many equations might want all values (outside) the range of a random variable to be checked or included, because the probability of it is zero they often don’t need to be

## Lecture 12 - April 21

• Linearity of expectation: for any two random variables $X$ and $Y$, $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$
• This extends to more than two random variables
• Simple summation proof
• $X$ and $Y$ do not need to be independent
• Constants are also fine (works like algebra intuition!)
• Indicator random variables are 1 if the event occurs and zero if it does not
• $\mathbf{1}[A]$
• The expectation of the indicator variable is the probability the event occurs
• How to compute complicated expectations:
• Decompose the random variable into the sum of simple random variables
• Apply linearity of expecation
• Compute the simple expectations

## Lecture 13 - April 24

• Variance is another one-number summary of a random variable
• We typically square values instead of using absolute values (think norms)
• Variance: $Var(X) = \sum_{\omega} \mathbb{P}(\omega) \cdot (X(\omega) - \mathbb{E}[X])^2$
• How “extreme” or “spread out” values are
• $= \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
• We should first find $\mathbb{E}[X]$
• If $X$ and $Y$ are independent:
• $Var(X + Y) = Var(X) + Var(Y)$
• $\mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot \mathbb{E}[Y]$
• Squaring an indicator variable does nothing to it
• Squaring random variables: we only square the value, not probability
• $Var(X + c) = Var(X)$
• Think of this as just shifting the distribution
• $Var(aX) = a^2 Var(X)$
• Stretching or compressing random variable

## Lecture 14 - April 26

• Introduces the random variable zoo: a list of facts about random variables (and distributions)
• Use a reference sheet or wikipedia!
• Memoryless - future outcomes aren’t influenced by past outcomes
• “Independent and identially distributed”: “iid”
• Bernoulli: one trial with a probability $p$ of success
• Binomial: how many success in $n$ independent trials with probability $p$ of success?
• Geometric: how many independent trials until the first success?
• Uniform: integer in some range with each value equally likely
• We need smallest and largest possible value for the range
• Negative Binomial: how many independent trials with probability $p$ until we have $r$ successes?

## Lecture 15 - April 28

• Poisson Distribution: we know the average over an interval of time, we can’t use each individual possible source
• (Kinda) requires each event to be independent
• We use Poisson because we don’t have good ideas of what the random variables are
• This is a model - not perfect (most real-world events are somewhat dependent), but it is useful!
• The PMF involves the Taylor Series for $e^x$
• It is a way of using the limit as the number of experiments approach infinity, but the mean stays constant
• Hypergeometric: drawing without probability from an urn
• Continuous random variables:
• We need continuous probability spaces and continuous random variables to represent uncountably-infinite sample spaces
• We use the probability density function for continuous random variables:
• It is a way of comparing probability of being near different events
• $f_x(k) \geq 0$
• $\int_{-\infty}^{\infty} f_x(k) dk = 1$
• We use this because we need the PDF to work for events
• Integrating is analogous to summation - continuous vs discrete values:
• $\mathbb{P}(a \leq X \leq B) = c$
• $\int_{a}^{b} f_x(k) dk = c$
• Impossible events still have probability 0, but some probability 0 events are still possible for continuous probability spaces

## Lecture 16 - May 1

• Continuous random variables require Probability Density Function:
• Main difference is that we use events vs values
• Every single value is equal to 0 (typically)
• The PDF is the number that when integrated over gives the probability of an event
• Comparing $f_X(k)$ and $f_X(l)$ gives relative chances of $X$ being near $k$ vs $l$
• $\mathbb{P}(a \leq X \leq b) = \int_{a}^{b} f_X(z)dz$
• Sometimes the density will be greater than 1
• Cumulative density function is analogous when using continuous vs discrete random variables:
• $F_X(k) = \mathbb{P}(X \leq k) = \int_{- \infty}^{k} f_X(z)dz$
• CDF to PDF by taking the derivative of CDF
• Undos the integral
• Vice-versa works as well
• General pattern: summation for discrete random variables becomes integration for continuous
• $\mathbb{E}[X] = \int_{- \infty}^{\infty} X(z) \cdot f_X(z) dz$
• Expectation of a function of a random variable:
• $\mathbb{E}[g(X)] = \int_{- \infty}^{\infty} g(X(z)) \cdot f_X(z)dz$
• “Law of the Unconcious Statistician” ~math nerds
• Linearity of Expectations still works!
• Variance: $Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \int_{- \infty}^{\infty} f_x(k)(X(k) - \mathbb{E}[X])^2 dk$
• Expectation of a uniform random varaible between $a$ and $b$ is the same, $\frac{a+b}{2}$
• Expectation of the uniform random variable squared is: $\frac{a^2 + ab + b^2}{3}$
• Variance of the variable is: $\frac{(b - a)^2}{12}$

## Lecture 17 - May 5

• Exponential random variable: like geometric random variable, but continuous time
• “Waiting doesn’t make the event happen sooner” - meomoryless
• $f_X(k) = \lambda e^{- \lambda k}$
• $\mathbb{E}[X] = \frac{1}{\lambda}$
• Same as taking a Poisson random variable and asking “How long until the next event?”
• $F_X(t) = \mathbb{P}(X \leq t) = 1 - e^{- \lambda t}$
• Expectation is: $\frac{1}{\lambda}$
• Variance is: $\frac{1}{\lambda^2}$
• Gaussian (normal) random variable:
• Mean: $\mu$
• Variance: $\sigma^2$
• $f_X(x) = \frac{1}{\sigma \sqrt{2 \pi}} \cdot e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$
• $F_X(k)$ has no nice closed form: use the table
• $\mathbb{E}[X] = \mu$
• This is the bell curve
• Scaling or adding to a normal variable results in another normal variable
• To normalize normal variable $X$: $Y = \frac{X - \mu}{\sigma}$
• We normalize because the CDF for normal random variables can be super rough
• We convert to a “standard normal”, round the “z-score” to the hundredths, look up on the table

## Lecture 18 - May 8

• The sum of any independent random variables approaches the normal distribution
• Let $X_1, X_2, \dots , X_n$ i.i.d. random variables with mean $\mu$ and variance $\sigma ^2$
• Let $Y_n = \frac{X_1 + X_2 + \dots + X_n - n \mu}{\sigma \sqrt{n}}$
• Then as $n \rightarrow \infty$, $Y_n$ converges to the CDF of $\mathbb{N}(0, 1)$
• Only equal in the limit!
• Gaussians often occur in the real world because the random variable is a combination of many independent factors
• We can often use the Gaussian CDF in practice instead of complicated independent variables
• Note that $\Phi$ represents the CDF of the Gaussian with mean 0 and variance 1
• The z-table only has positive values: we use $\Phi(-x) = 1 - \Phi(x)$
• When we have a discrete variable we’re approximating with a continuous random variable, we must do a continuity correction:
• We find all values that would round to the correct discrete value
• Other corrections are typically not worth the effort
• CLT Usage Outline:
1. Write down the event you’re interested in, in terms of the sum of random variables
2. Apply continiuty correction if RVs are discrete
3. Normalize RB to have mean 0 and standard deviation 1
4. Replace RV with $\mathbb{N}(0,1)$
5. Write event in terms of $\Phi$
6. Look up in table
• Polling: $\mathbb{P}(\vert X - \mathbb{E}[X] \vert \geq) s) \leq \epsilon$
• How often $\epsilon$ is our polling outside an acceptable margin $s$

## Lecture 19 - May 10

• If two RVs are independent, the probability they both equal some values is equal to their individual products
• We can use LTP to talk about only one variable (when they’re dependent)
• This is the marginal distribution, as we marginalized a RV
• The “marginal” for $X$ is where we marginalize all other RVs
• Joint expectation is still the sum of the probabilities times the value
• This is often written as a function
• Conditional expectations are also intuitively defined
• LOE still work!
• Law of Total Expectation: basically the same as LTP, but with expectations
• $\mathbb{E} = \sum_{i=1}^n \mathbb{E}[X \vert A_i] \mathbb{E}(A_i)$
• Covariance: how much $X$ and $Y$ change together
• $\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E})(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]$
• We often only care about if $\text{Cov}$ is positive or negative
• $\text{Variance}(X + Y) = \text{Variance}(X) + \text{Variance}(Y) + 2 \text{Cov}(X, Y)$

## Lecture 20 - May 12

• Two methods of polling: sampling with vs without replacement
• We will do the math for sampling with replacement even if this isn’t used in practice
• We don’t do continuity corrections when we’re polling
• Accuracy of polling isn’t determined by the total population amount, only the number of people polled
• Works for idealized polling scenarios with large-ish populations
• This is due to the way we use the two methods of polling
• The “margin of error” is an intuitive way of measuring variance
• “If I performed this poll repeatedly, 95% of the time the correct value will be within +/- the margin of error”
• Find the number of people necessary to guarantee a margin of error
• Handling $\sqrt{p(1-p)}$, assume worst case scenario for $p = 0.5$
• We use the z-table in reverse, find the smallest input that gives the correct output

## Lecture 21 - May 15

• We need a way to calculate the bounds with certainty
• Tail bound: “bounds the size of the tail”
• Markov Inequality:
• Let $X$ be a random variable supported only on non-negative numbers:
• $\mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}$
• $\mathbb{P}(X \geq k \mathbb{E}[X]) \leq \frac{1}{k}$
• Non-negative random variable and $k$ or $t$ is positive