CSE 312

Lecture 1 - March 27

We will cover probability, combinatorics, and some applications
Counting is helpful because:
- It helps with algo analysis
- It is a building block for probability
A set’s cardinality (size) is the number of distinct elements in it
Sum Rule: if we want to choose an element between two sets with no overlapping elements, we have the sum of the cardinalities number of options
Product Rule: when we have a sequential process of choosing elements, the number of possibilities is the sum of the number of options at each step
- This is equal to the size of the Cartesian Product of the sets
- Note that is works independent of which are the elements being chosen from, as long as the number of options is always the same at a step
What is the size of the power set of $S$?
- $2^{\mid S \mid}$
- Like the product rule, as each element can be in or out of a subset

Lecture 2 - March 29

When parsing HW problems, look out for “sequence” and “distinct”
Remember $0! = 1$
K Permutation:
- The number of $k$ element sequences of distinct symbols from a universe of $n$ symbols is: $P(n, k) = n (n - 1) \dots (n - k + 1) = \frac{n!}{(n - k)!}$
- Said as “P n K”, “n permute k”, or “n pick k”
Permutation vs subsets: does order matter?
K Combination:
- The number of $k$ element subsets from a set of $n$ symbols is: $C(n, k) = \frac{P(n, k)}{k!} = \frac{n!}{k!(n - k)!}$
- Said as “n choose k” typically
- Also written as: $\binom{n}{k}$
- We can think of this as finding the number where ordering matters, then dividing by the number of possible orders for each subset
We can go back and forth from combinations to permutations by dividing and multiplying by the number of possible orders within a subset (typically $\mid S \mid$)
Path counting is really just combinations
Overcounting is cool! (We just have to correct for it)
- e.g. anagrams of “SEATTLE”, we find all permutations, then divide by the product of the factorials of the number of each element
  - divide by $2! \cdot 2!$
Complementary Counting:
- We count all the complements of a set (everything that wouldn’t be valid), then subtract that from the cardinality of the universe!

Lecture 3 - April 1

Multinomial coefficient: $\binom{c}{a, b} \equiv \frac{c!}{a! \cdot b!}$
Counting two ways can be a good way to learn more and understand!
Symmetry of combinations: $\binom{n}{k} = \binom{n}{n - k}$
- Can prove using algebra
- Can prove using the “team-picking” idea: choosing who is on a team also indirectly chooses who is not on a team, and vice-versa
Pascal’s Rule: $\binom{n}{k} = \binom{n - 1}{k - 1} + \binom{n - 1}{k}$
- Again, we can prove using algebra
- We can also prove using the idea of the summation of all teams that I’m on and teams that I’m not on
  - “Focus on one element”
Binomial Theorem: $(x + y)^n = \sum_{i = 0}^{n} \binom{n}{i} x^i y^{n-i}$
- We have $xy$’s and $\binom{n}{i}$ determines how many there are
- Useful when we need to plug in specific numbers for x and y
Principle of Inclusion-Exclusion: when we are trying to get the size of a set which is the OR of multiple conditions, we can take the size of the union of the sets which satisfy at least one condition
- Sum rule for non-disjoint sets: $\mid A \cup B \mid = \mid A \mid + \mid B \mid - \mid A \cap B \mid$
- Sum rule for three non-disjoint sets follows similar logic, subtracting the intersection of each pair of sets, then adding the intersection of all sets
- Typically only use for two or three conditions (maybe four)

Lecture 4 - April 3

Pigeonhole Principle
- If there are more items then spots, we know that one spot has more than one item
- Strong Pigeonhole: If we have $n$ pigeons and $k$ pigeonholes, there is at least one pigeonhole with $\frac{k}{n}$ pigeons, rounded up
Placing dividers (stars and bars):
- If we need to find the number of different groups comprised of different types of elements, use dividers
- For n size groups of k types: $\binom{n+k-1}{k-1}$
- Can think about it as a binary string permutation problem

Lecture 5 - April 5

Probability is a way of quantifying our uncertainty
Sample space, $\Omega$, is all the possible outcomes of an experiment
- Single coin flip: $\Omega = \lbrace H, T \rbrace$
An event, $E \subseteq \Omega$, is a subset of all possible outcomes
A probability, $\mathbb{P}: \Omega \rightarrow [0, 1]$, is a function that maps an element of $\Omega$ to it’s likelihood of occurring
- Other notation: $Pr[\omega]$ or $P(\omega)$
- All probabilities must be between 0 and 1 and the sum of the probability for each element in the sample space must sum to 1
- Probability of an event should be the sum of the probabilities of the elements in the event
A probability space is a pair of sample space and probability function
Uniform Probability Space: all events have the same likelihood
Events are “mutually exclusive” if they cannot happen simultaneously
Axioms:
- Non-negativity: $\mathbb{P}(x) \geq 0$ for all $x$
- Normalization: $\sum_{x \in \Omega} \mathbb{P}(x) = 1$
- Countable additivity: If $E$ and $F$ are mutually exclusive, then: $\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F)$
Facts derived from axioms:
- Complementation: $\mathbb{P}(\bar{E}) = 1 - \mathbb{P}(E)$
- Monotonicity: if $E \subseteq F$, then $\mathbb{P}(E) \leq \mathbb{P}(F)$
- Inclusion-exclusion: $\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F) - \mathbb{P}(E \cap F)$

Lecture 6 - April 7

Often our sample space will contain excess information. This won’t make our answer incorrect, but can lead to unnecessary work
We use conditional probability when we have partial information
- It is a way to “restrict” the sample space
- $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}$
- This allows us to update our probabilities
- “The probability of A given B is equal to the probability of A and B happening, divided by the probability of B”

Lecture 7 - April 10

Bayes’ Rule allows us to use conditional probabilities
Bayes Rule:
- $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B)}$
The law of total probability:
- $\mathbb{P}(S) = \mathbb{P}(S \mid G) \cdot \mathbb{P}(G) + \mathbb{P}(S \mid \bar{G}) \cdot \mathbb{P}(\bar{G})$
- More generally: $\sum_{\forall i}\mathbb{P}(E \mid A_i) \mathbb{P}(A_i)$
- The probability of an event happening is equal to the sum of the probability of the event happening given another event happening, multiplied by the probability of the other event
A partition of the sample space is a family of subsets where each partition is distinct and they combine to be the entire sample space
Humans are very bad at understanding very large or small numbers - past a certain amount we ignore magnitudes
It is good to think of tests as an “update” to our priors, not a revelation of truth
When we condition on an event, we still have probability spaces: $B$ and probability measures: $\mathbb{P}(\omega \mid B)$
Do not condition on multiple events, only the intersection of them

Lecture 8 - April 12

(Statistical) independence is when the probabilities of two things don’t depend on each other:
- $\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B)$
- “Conditioning doesn’t make a difference”
Chain Rule:
- $\mathbb{P}(A_1 \cap A_2 \cap \dots \cap A_n) = \mathbb{P}(A_n \mid A_1 \cap \dots \cap A_{n-1}) \cdot \mathbb{P}(A_{n-1} \mid A_1 \cap \dots \cap A_{n-2}) \dots \mathbb{P}(A_2 \mid A_1) \cdot \mathbb{P}(A_1)$
- We can find this from: $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \rightarrow \mathbb{P}(A \cap B) = \mathbb{P}(A \mid B) \cdot \mathbb{P}(B)$
Conditional independence is independence of two events given another event

Lecture 9 - April 14

Medical terms for tests:
- $\mathbb{P}(D)$ is “prevalence”
- $\mathbb{P}(T \mid D)$ is “sensitivity”
- $\mathbb{P}(\bar{T} \mid \bar{D})$ is “specificity”
ONE WEIRD TRICK:
- Think of these problems with a large population (where at least one for each chance)
Intuition Trick: Bayes’ Factor:
- When you test positive, multiply prior by the Bayes’ Factor: $\frac{\text{sensitivity}}{\text{false positive rate}} = \frac{1 - FNR}{FPR}$
- Also called the “likelihood ratio”
- It is an estimate of how much I should update the prior by
- Better when the prior is quite low
When there are overwhelming differences between groups, response error can drown out small groups

Lecture 10 - April 17

We often implicitly define the sample space:
- This commonly occurs when the size of the sample space is infinite
When working with sample spaces with infinite size:
- We can use infinite sums of probabilities (and closed forms)
- We can use complement events
Random Variables:
- It is any function that has domain $\Omega$ and outputs a real number
- $X(\omega)$ is a summary of $\omega$
- It doesn’t change the problem, but it can simplify it!
One sample space can have many random variables
We always use capital letters as random variables
We commonly use lowercase variables as the values the random variable could take on
Random variables are a function
- We do not use typical function notation, instead something like: $X = 2$
The support (the range) is the set of values $X$ can take on
The event the random variable takes on a value: $\lbrace \omega : X(\omega) = x \rbrace$
- The probability of that event is: $\mathbb{P}(X = x)$
The function that gives us $\mathbb{P}(X = x)$ is the Probability Mass Function, or PMF
- This is written as: $p_X(x)$ or $f_X(x)$
- Think of the PMF as a piecewise function where values outside the support are zero because it must take in all real numbers as an input
The Cumulative Distribution Function (CDF) gives the probability $X \leq x$
- Written as: $F_X(x) = \mathbb{P}(X \leq x)$

Lecture 11 - April 19

CDF will always be defined over all real numbers (we must support all of them)
- We will often use floor functions when we only want to “support” integers
- Typically has zero below the support and one above the support
The “expectation” of a random variable $X$ is: $\mathbb{E}[X] = \sum_k k \cdot \mathbb{P}(X = k)$
- The “weighted average” of values $X$ could take on
- Weighted by the probability we actually see the value
- Think about it as the expectation of a drive in football:
  - $X$ is the possible scores $0, 2, 3, 6, 7$
  - We multiply each score by the likelihood that it happens, sum all of these for our expected score
- Not the most common outcome
The expectation of the sum of two random variables is equal to the sum of the expectations for each variable
Pairwise independece: for each pair in the set, they are independent
Mutual independence: for each subset of events, the probability of the intersection of all of them is equal to the product of all individual probabilities of the events
- Stronger than pairwise independence
Two random variables $X$ and $Y$ are independent if for all $k$ and $l$ (in the supports), $\mathbb{P}(X = k, Y = l) = \mathbb{P}(X = k) \cdot \mathbb{P}(Y = l)$
- Note that commas are often used instead of $\cap$ for random variables
Pairwise independence is the intuition
Mutual independence of random variables: $X_1, X_2, \dots, X_n$ are mutually independent if for all $x_1, x_2, \dots, x_n$ $\mathbb{P}(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n) = \mathbb{P}(X_1 = x_1) \cdot \mathbb{P}(X_2 = x_2) \cdot \dots \cdot \mathbb{P}(X_n = x_n)$
- We don’t need to check all subsets, but do need to check all possible values in the ranges
While many equations might want all values (outside) the range of a random variable to be checked or included, because the probability of it is zero they often don’t need to be

Lecture 12 - April 21

Linearity of expectation: for any two random variables $X$ and $Y$, $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$
- This extends to more than two random variables
- Simple summation proof
- $X$ and $Y$ do not need to be independent
- Constants are also fine (works like algebra intuition!)
Indicator random variables are 1 if the event occurs and zero if it does not
- $\mathbf{1}[A]$
- The expectation of the indicator variable is the probability the event occurs
How to compute complicated expectations:
- Decompose the random variable into the sum of simple random variables
- Apply linearity of expecation
- Compute the simple expectations

Lecture 13 - April 24

Variance is another one-number summary of a random variable
We typically square values instead of using absolute values (think norms)
Variance: $Var(X) = \sum_{\omega} \mathbb{P}(\omega) \cdot (X(\omega) - \mathbb{E}[X])^2$
- How “extreme” or “spread out” values are
- $= \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
- We should first find $\mathbb{E}[X]$
If $X$ and $Y$ are independent:
- $Var(X + Y) = Var(X) + Var(Y)$
- $\mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot \mathbb{E}[Y]$
Squaring an indicator variable does nothing to it
- Squaring random variables: we only square the value, not probability
$Var(X + c) = Var(X)$
- Think of this as just shifting the distribution
$Var(aX) = a^2 Var(X)$
- Stretching or compressing random variable

Lecture 14 - April 26

Introduces the random variable zoo: a list of facts about random variables (and distributions)
- Use a reference sheet or wikipedia!
Memoryless - future outcomes aren’t influenced by past outcomes
“Independent and identially distributed”: “iid”
Bernoulli: one trial with a probability $p$ of success
Binomial: how many success in $n$ independent trials with probability $p$ of success?
Geometric: how many independent trials until the first success?
Uniform: integer in some range with each value equally likely
- We need smallest and largest possible value for the range
Negative Binomial: how many independent trials with probability $p$ until we have $r$ successes?

Lecture 15 - April 28

Poisson Distribution: we know the average over an interval of time, we can’t use each individual possible source
- (Kinda) requires each event to be independent
- We use Poisson because we don’t have good ideas of what the random variables are
- This is a model - not perfect (most real-world events are somewhat dependent), but it is useful!
- The PMF involves the Taylor Series for $e^x$
- It is a way of using the limit as the number of experiments approach infinity, but the mean stays constant
Hypergeometric: drawing without probability from an urn
Continuous random variables:
- We need continuous probability spaces and continuous random variables to represent uncountably-infinite sample spaces
We use the probability density function for continuous random variables:
- It is a way of comparing probability of being near different events
- $f_x(k) \geq 0$
- $\int_{-\infty}^{\infty} f_x(k) dk = 1$
- We use this because we need the PDF to work for events
Integrating is analogous to summation - continuous vs discrete values:
- $\mathbb{P}(a \leq X \leq B) = c$
- $\int_{a}^{b} f_x(k) dk = c$
Impossible events still have probability 0, but some probability 0 events are still possible for continuous probability spaces

Lecture 16 - May 1

Continuous random variables require Probability Density Function:
- Main difference is that we use events vs values
- Every single value is equal to 0 (typically)
- The PDF is the number that when integrated over gives the probability of an event
- Comparing $f_X(k)$ and $f_X(l)$ gives relative chances of $X$ being near $k$ vs $l$
- $\mathbb{P}(a \leq X \leq b) = \int_{a}^{b} f_X(z)dz $
- Sometimes the density will be greater than 1
Cumulative density function is analogous when using continuous vs discrete random variables:
- $F_X(k) = \mathbb{P}(X \leq k) = \int_{- \infty}^{k} f_X(z)dz$
CDF to PDF by taking the derivative of CDF
- Undos the integral
- Vice-versa works as well
General pattern: summation for discrete random variables becomes integration for continuous
$\mathbb{E}[X] = \int_{- \infty}^{\infty} X(z) \cdot f_X(z) dz$
Expectation of a function of a random variable:
- $\mathbb{E}[g(X)] = \int_{- \infty}^{\infty} g(X(z)) \cdot f_X(z)dz$
- “Law of the Unconcious Statistician” ~math nerds
Linearity of Expectations still works!
Variance: $Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \int_{- \infty}^{\infty} f_x(k)(X(k) - \mathbb{E}[X])^2 dk$
Expectation of a uniform random varaible between $a$ and $b$ is the same, $\frac{a+b}{2}$
Expectation of the uniform random variable squared is: $\frac{a^2 + ab + b^2}{3}$
Variance of the variable is: $\frac{(b - a)^2}{12}$

Lecture 17 - May 5

Exponential random variable: like geometric random variable, but continuous time
- “Waiting doesn’t make the event happen sooner” - meomoryless
- $f_X(k) = \lambda e^{- \lambda k}$
- $\mathbb{E}[X] = \frac{1}{\lambda}$
- Same as taking a Poisson random variable and asking “How long until the next event?”
- $F_X(t) = \mathbb{P}(X \leq t) = 1 - e^{- \lambda t}$
- Expectation is: $\frac{1}{\lambda}$
- Variance is: $\frac{1}{\lambda^2}$
Gaussian (normal) random variable:
- Mean: $\mu$
- Variance: $\sigma^2$
- $f_X(x) = \frac{1}{\sigma \sqrt{2 \pi}} \cdot e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$
- $F_X(k)$ has no nice closed form: use the table
- $\mathbb{E}[X] = \mu$
- This is the bell curve
- Scaling or adding to a normal variable results in another normal variable
To normalize normal variable $X$: $Y = \frac{X - \mu}{\sigma}$
- We normalize because the CDF for normal random variables can be super rough
- We convert to a “standard normal”, round the “z-score” to the hundredths, look up on the table

Lecture 18 - May 8

The sum of any independent random variables approaches the normal distribution
- Let $X_1, X_2, \dots , X_n$ i.i.d. random variables with mean $\mu$ and variance $\sigma ^2$
- Let $Y_n = \frac{X_1 + X_2 + \dots + X_n - n \mu}{\sigma \sqrt{n}}$
- Then as $n \rightarrow \infty$, $Y_n$ converges to the CDF of $\mathbb{N}(0, 1)$
- Only equal in the limit!
Gaussians often occur in the real world because the random variable is a combination of many independent factors
We can often use the Gaussian CDF in practice instead of complicated independent variables
Note that $\Phi$ represents the CDF of the Gaussian with mean 0 and variance 1
The z-table only has positive values: we use $\Phi(-x) = 1 - \Phi(x)$
When we have a discrete variable we’re approximating with a continuous random variable, we must do a continuity correction:
- We find all values that would round to the correct discrete value
- Other corrections are typically not worth the effort
CLT Usage Outline:
1. Write down the event you’re interested in, in terms of the sum of random variables
2. Apply continiuty correction if RVs are discrete
3. Normalize RB to have mean 0 and standard deviation 1
4. Replace RV with $\mathbb{N}(0,1)$
5. Write event in terms of $\Phi$
6. Look up in table
Polling: $\mathbb{P}(\vert X - \mathbb{E}[X] \vert \geq) s) \leq \epsilon$
- How often $\epsilon$ is our polling outside an acceptable margin $s$

Lecture 19 - May 10

If two RVs are independent, the probability they both equal some values is equal to their individual products
We can use LTP to talk about only one variable (when they’re dependent)
- This is the marginal distribution, as we marginalized a RV
- The “marginal” for $X$ is where we marginalize all other RVs
Joint expectation is still the sum of the probabilities times the value
- This is often written as a function
- Conditional expectations are also intuitively defined
- LOE still work!
Law of Total Expectation: basically the same as LTP, but with expectations
- $\mathbb{E} = \sum_{i=1}^n \mathbb{E}[X \vert A_i] \mathbb{E}(A_i)$
Covariance: how much $X$ and $Y$ change together
- $\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E})(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]$
- We often only care about if $\text{Cov}$ is positive or negative
$\text{Variance}(X + Y) = \text{Variance}(X) + \text{Variance}(Y) + 2 \text{Cov}(X, Y)$

Lecture 20 - May 12

Two methods of polling: sampling with vs without replacement
- We will do the math for sampling with replacement even if this isn’t used in practice
We don’t do continuity corrections when we’re polling
Accuracy of polling isn’t determined by the total population amount, only the number of people polled
- Works for idealized polling scenarios with large-ish populations
- This is due to the way we use the two methods of polling
The “margin of error” is an intuitive way of measuring variance
- “If I performed this poll repeatedly, 95% of the time the correct value will be within +/- the margin of error”
Find the number of people necessary to guarantee a margin of error
- Handling $\sqrt{p(1-p)}$, assume worst case scenario for $p = 0.5$
- We use the z-table in reverse, find the smallest input that gives the correct output

Lecture 21 - May 15

We need a way to calculate the bounds with certainty
Tail bound: “bounds the size of the tail”
Markov Inequality:
- Let $X$ be a random variable supported only on non-negative numbers:
  - $\mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}$
  - $\mathbb{P}(X \geq k \mathbb{E}[X]) \leq \frac{1}{k}$
- Non-negative random variable and $k$ or $t$ is positive