# CSE 312

## Lecture 1 - March 27

- We will cover probability, combinatorics, and some applications
- Counting is helpful because:
- It helps with algo analysis
- It is a building block for probability

- A set’s cardinality (size) is the number of distinct elements in it
*Sum Rule:*if we want to choose an element between two sets with no overlapping elements, we have the sum of the cardinalities number of options*Product Rule:*when we have a sequential process of choosing elements, the number of possibilities is the sum of the number of options*at each step*- This is equal to the size of the Cartesian Product of the sets
- Note that is works independent of which are the elements being chosen from, as long as the number of options is always the same at a step

- What is the size of the power set of $S$?
- $2^{\mid S \mid}$
- Like the product rule, as each element can be in or out of a subset

## Lecture 2 - March 29

- When parsing HW problems, look out for “sequence” and “distinct”
- Remember $0! = 1$
**K Permutation:**- The number of $k$ element sequences of distinct symbols from a universe of $n$ symbols is: $P(n, k) = n (n - 1) \dots (n - k + 1) = \frac{n!}{(n - k)!}$
- Said as “P n K”, “n permute k”, or “n pick k”

- Permutation vs subsets: does order matter?
**K Combination:**- The number of $k$ element subsets from a set of $n$ symbols is: $C(n, k) = \frac{P(n, k)}{k!} = \frac{n!}{k!(n - k)!}$
- Said as “n choose k” typically
- Also written as: $\binom{n}{k}$
- We can think of this as finding the number where ordering matters, then dividing by the number of possible orders for each subset

- We can go back and forth from combinations to permutations by dividing and multiplying by the number of possible orders within a subset (typically $\mid S \mid$)
- Path counting is really just combinations
- Overcounting is cool! (We just have to correct for it)
- e.g. anagrams of “SEATTLE”, we find all permutations, then divide by the product of the factorials of the number of each element
- divide by $2! \cdot 2!$

- e.g. anagrams of “SEATTLE”, we find all permutations, then divide by the product of the factorials of the number of each element
**Complementary Counting:**- We count all the
*complements*of a set (everything that*wouldn’t*be valid), then subtract that from the cardinality of the universe!

- We count all the

## Lecture 3 - April 1

**Multinomial coefficient:**$\binom{c}{a, b} \equiv \frac{c!}{a! \cdot b!}$- Counting two ways can be a good way to learn more and understand!
**Symmetry of combinations:**$\binom{n}{k} = \binom{n}{n - k}$- Can prove using algebra
- Can prove using the “team-picking” idea: choosing who is on a team also indirectly chooses who is not on a team, and vice-versa

**Pascal’s Rule:**$\binom{n}{k} = \binom{n - 1}{k - 1} + \binom{n - 1}{k}$- Again, we can prove using algebra
- We can also prove using the idea of the summation of all teams that I’m on and teams that I’m not on
- “Focus on one element”

**Binomial Theorem:**$(x + y)^n = \sum_{i = 0}^{n} \binom{n}{i} x^i y^{n-i}$- We have $xy$’s and $\binom{n}{i}$ determines how many there are
- Useful when we need to plug in specific numbers for x and y

**Principle of Inclusion-Exclusion:**when we are trying to get the size of a set which is the OR of multiple conditions, we can take the size of the union of the sets which satisfy at least one condition- Sum rule for non-disjoint sets: $\mid A \cup B \mid = \mid A \mid + \mid B \mid - \mid A \cap B \mid$
- Sum rule for three non-disjoint sets follows similar logic, subtracting the intersection of each pair of sets, then adding the intersection of all sets
- Typically only use for two or three conditions (maybe four)

## Lecture 4 - April 3

- Pigeonhole Principle
- If there are more items then spots, we know that one spot has more than one item
- Strong Pigeonhole: If we have $n$ pigeons and $k$ pigeonholes, there is at least one pigeonhole with $\frac{k}{n}$ pigeons,
*rounded up*

- Placing dividers (stars and bars):
- If we need to find the number of different groups comprised of different types of elements, use dividers
- For n size groups of k types: $\binom{n+k-1}{k-1}$
- Can think about it as a binary string permutation problem

## Lecture 5 - April 5

- Probability is a way of quantifying our uncertainty
- Sample space, $\Omega$, is all the possible outcomes of an experiment
- Single coin flip: $\Omega = \lbrace H, T \rbrace$

- An event, $E \subseteq \Omega$, is a subset of all possible outcomes
- A probability, $\mathbb{P}: \Omega \rightarrow [0, 1]$, is a function that maps an element of $\Omega$ to it’s likelihood of occurring
- Other notation: $Pr[\omega]$ or $P(\omega)$
- All probabilities must be between 0 and 1 and the sum of the probability for each element in the sample space must sum to 1
- Probability of an event should be the sum of the probabilities of the elements in the event

- A probability space is a pair of sample space and probability function
- Uniform Probability Space: all events have the same likelihood
- Events are “mutually exclusive” if they cannot happen simultaneously
- Axioms:
- Non-negativity: $\mathbb{P}(x) \geq 0$ for all $x$
- Normalization: $\sum_{x \in \Omega} \mathbb{P}(x) = 1$
- Countable additivity: If $E$ and $F$ are mutually exclusive, then: $\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F)$

- Facts derived from axioms:
- Complementation: $\mathbb{P}(\bar{E}) = 1 - \mathbb{P}(E)$
- Monotonicity: if $E \subseteq F$, then $\mathbb{P}(E) \leq \mathbb{P}(F)$
- Inclusion-exclusion: $\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F) - \mathbb{P}(E \cap F)$

## Lecture 6 - April 7

- Often our sample space will contain excess information. This won’t make our answer incorrect, but can lead to unnecessary work
- We use conditional probability when we have partial information
- It is a way to “restrict” the sample space
- $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}$
- This allows us to update our probabilities
- “The probability of A given B is equal to the probability of A and B happening, divided by the probability of B”

## Lecture 7 - April 10

- Bayes’ Rule allows us to use conditional probabilities
- Bayes Rule:
- $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B)}$

- The law of total probability:
- $\mathbb{P}(S) = \mathbb{P}(S \mid G) \cdot \mathbb{P}(G) + \mathbb{P}(S \mid \bar{G}) \cdot \mathbb{P}(\bar{G})$
- More generally: $\sum_{\forall i}\mathbb{P}(E \mid A_i) \mathbb{P}(A_i)$
- The probability of an event happening is equal to the sum of the probability of the event happening given another event happening, multiplied by the probability of the other event

- A partition of the sample space is a family of subsets where each partition is distinct and they combine to be the entire sample space
- Humans are very bad at understanding very large or small numbers - past a certain amount we ignore magnitudes
- It is good to think of tests as an “update” to our priors, not a revelation of truth
- When we condition on an event, we still have probability spaces: $B$ and probability measures: $\mathbb{P}(\omega \mid B)$
- Do
*not*condition on multiple events, only the intersection of them

## Lecture 8 - April 12

- (Statistical) independence is when the probabilities of two things don’t depend on each other:
- $\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B)$
- “Conditioning doesn’t make a difference”

- Chain Rule:
- $\mathbb{P}(A_1 \cap A_2 \cap \dots \cap A_n) = \mathbb{P}(A_n \mid A_1 \cap \dots \cap A_{n-1}) \cdot \mathbb{P}(A_{n-1} \mid A_1 \cap \dots \cap A_{n-2}) \dots \mathbb{P}(A_2 \mid A_1) \cdot \mathbb{P}(A_1)$
- We can find this from: $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \rightarrow \mathbb{P}(A \cap B) = \mathbb{P}(A \mid B) \cdot \mathbb{P}(B)$

- Conditional independence is independence of two events given another event

## Lecture 9 - April 14

- Medical terms for tests:
- $\mathbb{P}(D)$ is “prevalence”
- $\mathbb{P}(T \mid D)$ is “sensitivity”
- $\mathbb{P}(\bar{T} \mid \bar{D})$ is “specificity”

**ONE WEIRD TRICK:**- Think of these problems with a large population (where at least one for each chance)

- Intuition Trick: Bayes’ Factor:
- When you test positive, multiply prior by the Bayes’ Factor: $\frac{\text{sensitivity}}{\text{false positive rate}} = \frac{1 - FNR}{FPR}$
- Also called the “likelihood ratio”
- It is an estimate of how much I should update the prior by
- Better when the prior is quite low

- When there are overwhelming differences between groups, response error can drown out small groups

## Lecture 10 - April 17

- We often implicitly define the sample space:
- This commonly occurs when the size of the sample space is infinite

- When working with sample spaces with infinite size:
- We can use infinite sums of probabilities (and closed forms)
- We can use complement events

- Random Variables:
- It is any function that has domain $\Omega$ and outputs a real number
- $X(\omega)$ is a summary of $\omega$
- It doesn’t
*change*the problem, but it can simplify it!

- One sample space can have many random variables
- We always use capital letters as random variables
- We commonly use lowercase variables as the values the random variable could take on
- Random variables are a function
- We do not use typical function notation, instead something like: $X = 2$

- The
**support**(the range) is the set of values $X$ can take on - The event the random variable takes on a value: $\lbrace \omega : X(\omega) = x \rbrace$
- The probability of that event is: $\mathbb{P}(X = x)$

- The function that gives us $\mathbb{P}(X = x)$ is the Probability Mass Function, or PMF
- This is written as: $p_X(x)$ or $f_X(x)$
- Think of the PMF as a piecewise function where values outside the support are zero because it must take in all real numbers as an input

- The Cumulative Distribution Function (CDF) gives the probability $X \leq x$
- Written as: $F_X(x) = \mathbb{P}(X \leq x)$

## Lecture 11 - April 19

- CDF will always be defined over all real numbers (we must support all of them)
- We will often use floor functions when we only want to “support” integers
- Typically has zero below the support and one above the support

- The “expectation” of a random variable $X$ is: $\mathbb{E}[X] = \sum_k k \cdot
\mathbb{P}(X = k)$
- The “weighted average” of values $X$ could take on
- Weighted by the probability we actually see the value
- Think about it as the expectation of a drive in football:
- $X$ is the possible scores $0, 2, 3, 6, 7$
- We multiply each score by the likelihood that it happens, sum all of these for our expected score

- Not the most common outcome

- The expectation of the sum of two random variables is equal to the sum of the expectations for each variable
- Pairwise independece: for each pair in the set, they are independent
- Mutual independence: for each subset of events, the probability of the
intersection of all of them is equal to the product of all individual
probabilities of the events
- Stronger than pairwise independence

- Two random variables $X$ and $Y$ are independent if for all $k$ and $l$ (in
the supports), $\mathbb{P}(X = k, Y = l) = \mathbb{P}(X = k) \cdot
\mathbb{P}(Y = l)$
- Note that commas are often used instead of $\cap$ for random variables

- Pairwise independence is the intuition
- Mutual independence of random variables: $X_1, X_2, \dots, X_n$ are mutually
independent if for all $x_1, x_2, \dots, x_n$ $\mathbb{P}(X_1 = x_1, X_2 =
x_2, \dots, X_n = x_n) = \mathbb{P}(X_1 = x_1) \cdot \mathbb{P}(X_2 = x_2)
\cdot \dots \cdot \mathbb{P}(X_n = x_n)$
- We
*don’t*need to check all subsets, but*do*need to check all possible values in the ranges

- We
- While many equations might want all values (outside) the range of a random variable to be checked or included, because the probability of it is zero they often don’t need to be

## Lecture 12 - April 21

- Linearity of expectation: for any two random variables $X$ and $Y$,
$\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$
- This extends to more than two random variables
- Simple summation proof
- $X$ and $Y$ do
**not**need to be independent - Constants are also fine (works like algebra intuition!)

- Indicator random variables are 1 if the event occurs and zero if it does not
- $\mathbf{1}[A]$
- The expectation of the indicator variable is the probability the event occurs

- How to compute complicated expectations:
- Decompose the random variable into the sum of simple random variables
- Apply linearity of expecation
- Compute the simple expectations

## Lecture 13 - April 24

- Variance is another one-number summary of a random variable
- We typically square values instead of using absolute values (think norms)
- Variance: $Var(X) = \sum_{\omega} \mathbb{P}(\omega) \cdot (X(\omega) - \mathbb{E}[X])^2$
- How “extreme” or “spread out” values are
- $= \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
- We should first find $\mathbb{E}[X]$

- If $X$ and $Y$ are independent:
- $Var(X + Y) = Var(X) + Var(Y)$
- $\mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot \mathbb{E}[Y]$

- Squaring an indicator variable does nothing to it
- Squaring random variables:
*we only square the value, not probability*

- Squaring random variables:
- $Var(X + c) = Var(X)$
- Think of this as just shifting the distribution

- $Var(aX) = a^2 Var(X)$
- Stretching or compressing random variable

## Lecture 14 - April 26

- Introduces the random variable zoo: a list of facts about random variables (and distributions)
- Use a reference sheet or wikipedia!

- Memoryless - future outcomes aren’t influenced by past outcomes
- “Independent and identially distributed”: “iid”
- Bernoulli: one trial with a probability $p$ of success
- Binomial: how many success in $n$ independent trials with probability $p$ of success?
- Geometric: how many independent trials until the first success?
- Uniform: integer in some range with each value equally likely
- We need smallest and largest possible value for the range

- Negative Binomial: how many independent trials with probability $p$ until we have $r$ successes?

## Lecture 15 - April 28

- Poisson Distribution: we know the average over an interval of time, we can’t use each individual possible source
- (Kinda) requires each event to be independent
- We use Poisson because we don’t have good ideas of what the random variables are
- This is a model - not perfect (most real-world events are somewhat dependent), but it is useful!
- The PMF involves the Taylor Series for $e^x$
- It is a way of using the limit as the number of experiments approach infinity, but the mean stays constant

- Hypergeometric: drawing without probability from an urn
- Continuous random variables:
- We need continuous probability spaces and continuous random variables to represent uncountably-infinite sample spaces

- We use the probability density function for continuous random variables:
- It is a way of comparing probability of being near different events
- $f_x(k) \geq 0$
- $\int_{-\infty}^{\infty} f_x(k) dk = 1$
- We use this because we need the PDF to work for events

- Integrating is analogous to summation - continuous vs discrete values:
- $\mathbb{P}(a \leq X \leq B) = c$
- $\int_{a}^{b} f_x(k) dk = c$

- Impossible events still have probability 0, but some probability 0 events are still possible for continuous probability spaces

## Lecture 16 - May 1

- Continuous random variables require Probability Density Function:
- Main difference is that we use events vs values
- Every single value is equal to 0 (typically)
- The PDF is the number that when integrated over gives the probability of an event
- Comparing $f_X(k)$ and $f_X(l)$ gives relative chances of $X$ being near $k$ vs $l$
- $\mathbb{P}(a \leq X \leq b) = \int_{a}^{b} f_X(z)dz $
- Sometimes the density will be greater than 1

- Cumulative density function is analogous when using continuous vs discrete random variables:
- $F_X(k) = \mathbb{P}(X \leq k) = \int_{- \infty}^{k} f_X(z)dz$

- CDF to PDF by taking the derivative of CDF
- Undos the integral
- Vice-versa works as well

- General pattern: summation for discrete random variables becomes integration for continuous
- $\mathbb{E}[X] = \int_{- \infty}^{\infty} X(z) \cdot f_X(z) dz$
- Expectation of a function of a random variable:
- $\mathbb{E}[g(X)] = \int_{- \infty}^{\infty} g(X(z)) \cdot f_X(z)dz$
- “Law of the Unconcious Statistician” ~math nerds

- Linearity of Expectations still works!
- Variance: $Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \int_{- \infty}^{\infty} f_x(k)(X(k) - \mathbb{E}[X])^2 dk$
- Expectation of a uniform random varaible between $a$ and $b$ is the same, $\frac{a+b}{2}$
- Expectation of the uniform random variable squared is: $\frac{a^2 + ab + b^2}{3}$
- Variance of the variable is: $\frac{(b - a)^2}{12}$

## Lecture 17 - May 5

- Exponential random variable: like geometric random variable, but continuous time
- “Waiting doesn’t make the event happen sooner” - meomoryless
- $f_X(k) = \lambda e^{- \lambda k}$
- $\mathbb{E}[X] = \frac{1}{\lambda}$
- Same as taking a Poisson random variable and asking “How long until the next event?”
- $F_X(t) = \mathbb{P}(X \leq t) = 1 - e^{- \lambda t}$
- Expectation is: $\frac{1}{\lambda}$
- Variance is: $\frac{1}{\lambda^2}$

- Gaussian (normal) random variable:
- Mean: $\mu$
- Variance: $\sigma^2$
- $f_X(x) = \frac{1}{\sigma \sqrt{2 \pi}} \cdot e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$
- $F_X(k)$ has no nice closed form: use the table
- $\mathbb{E}[X] = \mu$
- This is the bell curve
- Scaling or adding to a normal variable results in another normal variable

- To normalize normal variable $X$: $Y = \frac{X - \mu}{\sigma}$
- We normalize because the CDF for normal random variables can be super rough
- We convert to a “standard normal”, round the “z-score” to the hundredths, look up on the table

## Lecture 18 - May 8

- The sum of any independent random variables approaches the normal distribution
- Let $X_1, X_2, \dots , X_n$ i.i.d. random variables with mean $\mu$ and variance $\sigma ^2$
- Let $Y_n = \frac{X_1 + X_2 + \dots + X_n - n \mu}{\sigma \sqrt{n}}$
- Then as $n \rightarrow \infty$, $Y_n$ converges to the CDF of $\mathbb{N}(0, 1)$
- Only equal in the limit!

- Gaussians often occur in the real world because the random variable is a combination of many independent factors
- We can often use the Gaussian CDF in practice instead of complicated independent variables
- Note that $\Phi$ represents the CDF of the Gaussian with mean 0 and variance 1
- The z-table only has positive values: we use $\Phi(-x) = 1 - \Phi(x)$
- When we have a discrete variable we’re approximating with a continuous random variable, we must do a continuity correction:
- We find all values that would
*round*to the correct discrete value - Other corrections are typically not worth the effort

- We find all values that would
- CLT Usage Outline:
- Write down the event you’re interested in, in terms of the sum of random variables
- Apply continiuty correction if RVs are discrete
- Normalize RB to have mean 0 and standard deviation 1
- Replace RV with $\mathbb{N}(0,1)$
- Write event in terms of $\Phi$
- Look up in table

- Polling: $\mathbb{P}(\vert X - \mathbb{E}[X] \vert \geq) s) \leq \epsilon$
- How often $\epsilon$ is our polling outside an acceptable margin $s$

## Lecture 19 - May 10

- If two RVs are independent, the probability they both equal some values is equal to their individual products
- We can use LTP to talk about only one variable (when they’re dependent)
- This is the marginal distribution, as we marginalized a RV
- The “marginal” for $X$ is where we marginalize all other RVs

- Joint expectation is still the sum of the probabilities times the value
- This is often written as a function
- Conditional expectations are also intuitively defined
- LOE still work!

- Law of Total Expectation: basically the same as LTP, but with expectations
- $\mathbb{E} = \sum_{i=1}^n \mathbb{E}[X \vert A_i] \mathbb{E}(A_i)$

- Covariance: how much $X$ and $Y$ change
*together*- $\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E})(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]$
- We often only care about if $\text{Cov}$ is positive or negative

- $\text{Variance}(X + Y) = \text{Variance}(X) + \text{Variance}(Y) + 2 \text{Cov}(X, Y)$

## Lecture 20 - May 12

- Two methods of polling: sampling with vs without replacement
- We will do the math for sampling with replacement even if this isn’t used in practice

- We don’t do continuity corrections when we’re polling
- Accuracy of polling isn’t determined by the total population amount, only the number of people polled
- Works for idealized polling scenarios with large-ish populations
- This is due to the way we use the two methods of polling

- The “margin of error” is an intuitive way of measuring variance
- “If I performed this poll repeatedly, 95% of the time the correct value will be within +/- the margin of error”

- Find the number of people necessary to guarantee a margin of error
- Handling $\sqrt{p(1-p)}$, assume worst case scenario for $p = 0.5$
- We use the z-table in reverse, find the smallest input that gives the correct output

## Lecture 21 - May 15

- We need a way to calculate the bounds with certainty
- Tail bound: “bounds the size of the tail”
- Markov Inequality:
- Let $X$ be a random variable supported only on non-negative numbers:
- $\mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}$
- $\mathbb{P}(X \geq k \mathbb{E}[X]) \leq \frac{1}{k}$

- Non-negative random variable and $k$ or $t$ is positive

- Let $X$ be a random variable supported only on non-negative numbers: