CSE 312

Lecture 1 - March 27

  • We will cover probability, combinatorics, and some applications
  • Counting is helpful because:
    • It helps with algo analysis
    • It is a building block for probability
  • A set’s cardinality (size) is the number of distinct elements in it
  • Sum Rule: if we want to choose an element between two sets with no overlapping elements, we have the sum of the cardinalities number of options
  • Product Rule: when we have a sequential process of choosing elements, the number of possibilities is the sum of the number of options at each step
    • This is equal to the size of the Cartesian Product of the sets
    • Note that is works independent of which are the elements being chosen from, as long as the number of options is always the same at a step
  • What is the size of the power set of $S$?
    • $2^{\mid S \mid}$
    • Like the product rule, as each element can be in or out of a subset

Lecture 2 - March 29

  • When parsing HW problems, look out for “sequence” and “distinct”
  • Remember $0! = 1$
  • K Permutation:
    • The number of $k$ element sequences of distinct symbols from a universe of $n$ symbols is: $P(n, k) = n (n - 1) \dots (n - k + 1) = \frac{n!}{(n - k)!}$
    • Said as “P n K”, “n permute k”, or “n pick k”
  • Permutation vs subsets: does order matter?
  • K Combination:
    • The number of $k$ element subsets from a set of $n$ symbols is: $C(n, k) = \frac{P(n, k)}{k!} = \frac{n!}{k!(n - k)!}$
    • Said as “n choose k” typically
    • Also written as: $\binom{n}{k}$
    • We can think of this as finding the number where ordering matters, then dividing by the number of possible orders for each subset
  • We can go back and forth from combinations to permutations by dividing and multiplying by the number of possible orders within a subset (typically $\mid S \mid$)
  • Path counting is really just combinations
  • Overcounting is cool! (We just have to correct for it)
    • e.g. anagrams of “SEATTLE”, we find all permutations, then divide by the product of the factorials of the number of each element
      • divide by $2! \cdot 2!$
  • Complementary Counting:
    • We count all the complements of a set (everything that wouldn’t be valid), then subtract that from the cardinality of the universe!

Lecture 3 - April 1

  • Multinomial coefficient: $\binom{c}{a, b} \equiv \frac{c!}{a! \cdot b!}$
  • Counting two ways can be a good way to learn more and understand!
  • Symmetry of combinations: $\binom{n}{k} = \binom{n}{n - k}$
    • Can prove using algebra
    • Can prove using the “team-picking” idea: choosing who is on a team also indirectly chooses who is not on a team, and vice-versa
  • Pascal’s Rule: $\binom{n}{k} = \binom{n - 1}{k - 1} + \binom{n - 1}{k}$
    • Again, we can prove using algebra
    • We can also prove using the idea of the summation of all teams that I’m on and teams that I’m not on
      • “Focus on one element”
  • Binomial Theorem: $(x + y)^n = \sum_{i = 0}^{n} \binom{n}{i} x^i y^{n-i}$
    • We have $xy$’s and $\binom{n}{i}$ determines how many there are
    • Useful when we need to plug in specific numbers for x and y
  • Principle of Inclusion-Exclusion: when we are trying to get the size of a set which is the OR of multiple conditions, we can take the size of the union of the sets which satisfy at least one condition
    • Sum rule for non-disjoint sets: $\mid A \cup B \mid = \mid A \mid + \mid B \mid - \mid A \cap B \mid$
    • Sum rule for three non-disjoint sets follows similar logic, subtracting the intersection of each pair of sets, then adding the intersection of all sets
    • Typically only use for two or three conditions (maybe four)

Lecture 4 - April 3

  • Pigeonhole Principle
    • If there are more items then spots, we know that one spot has more than one item
    • Strong Pigeonhole: If we have $n$ pigeons and $k$ pigeonholes, there is at least one pigeonhole with $\frac{k}{n}$ pigeons, rounded up
  • Placing dividers (stars and bars):
    • If we need to find the number of different groups comprised of different types of elements, use dividers
    • For n size groups of k types: $\binom{n+k-1}{k-1}$
    • Can think about it as a binary string permutation problem

Lecture 5 - April 5

  • Probability is a way of quantifying our uncertainty
  • Sample space, $\Omega$, is all the possible outcomes of an experiment
    • Single coin flip: $\Omega = \lbrace H, T \rbrace$
  • An event, $E \subseteq \Omega$, is a subset of all possible outcomes
  • A probability, $\mathbb{P}: \Omega \rightarrow [0, 1]$, is a function that maps an element of $\Omega$ to it’s likelihood of occurring
    • Other notation: $Pr[\omega]$ or $P(\omega)$
    • All probabilities must be between 0 and 1 and the sum of the probability for each element in the sample space must sum to 1
    • Probability of an event should be the sum of the probabilities of the elements in the event
  • A probability space is a pair of sample space and probability function
  • Uniform Probability Space: all events have the same likelihood
  • Events are “mutually exclusive” if they cannot happen simultaneously
  • Axioms:
    • Non-negativity: $\mathbb{P}(x) \geq 0$ for all $x$
    • Normalization: $\sum_{x \in \Omega} \mathbb{P}(x) = 1$
    • Countable additivity: If $E$ and $F$ are mutually exclusive, then: $\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F)$
  • Facts derived from axioms:
    • Complementation: $\mathbb{P}(\bar{E}) = 1 - \mathbb{P}(E)$
    • Monotonicity: if $E \subseteq F$, then $\mathbb{P}(E) \leq \mathbb{P}(F)$
    • Inclusion-exclusion: $\mathbb{P}(E \cup F) = \mathbb{P}(E) + \mathbb{P}(F) - \mathbb{P}(E \cap F)$

Lecture 6 - April 7

  • Often our sample space will contain excess information. This won’t make our answer incorrect, but can lead to unnecessary work
  • We use conditional probability when we have partial information
    • It is a way to “restrict” the sample space
    • $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}$
    • This allows us to update our probabilities
    • “The probability of A given B is equal to the probability of A and B happening, divided by the probability of B”

Lecture 7 - April 10

  • Bayes’ Rule allows us to use conditional probabilities
  • Bayes Rule:
    • $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B)}$
  • The law of total probability:
    • $\mathbb{P}(S) = \mathbb{P}(S \mid G) \cdot \mathbb{P}(G) + \mathbb{P}(S \mid \bar{G}) \cdot \mathbb{P}(\bar{G})$
    • More generally: $\sum_{\forall i}\mathbb{P}(E \mid A_i) \mathbb{P}(A_i)$
    • The probability of an event happening is equal to the sum of the probability of the event happening given another event happening, multiplied by the probability of the other event
  • A partition of the sample space is a family of subsets where each partition is distinct and they combine to be the entire sample space
  • Humans are very bad at understanding very large or small numbers - past a certain amount we ignore magnitudes
  • It is good to think of tests as an “update” to our priors, not a revelation of truth
  • When we condition on an event, we still have probability spaces: $B$ and probability measures: $\mathbb{P}(\omega \mid B)$
  • Do not condition on multiple events, only the intersection of them

Lecture 8 - April 12

  • (Statistical) independence is when the probabilities of two things don’t depend on each other:
    • $\mathbb{P}(A \cap B) = \mathbb{P}(A) \cdot \mathbb{P}(B)$
    • “Conditioning doesn’t make a difference”
  • Chain Rule:
    • $\mathbb{P}(A_1 \cap A_2 \cap \dots \cap A_n) = \mathbb{P}(A_n \mid A_1 \cap \dots \cap A_{n-1}) \cdot \mathbb{P}(A_{n-1} \mid A_1 \cap \dots \cap A_{n-2}) \dots \mathbb{P}(A_2 \mid A_1) \cdot \mathbb{P}(A_1)$
    • We can find this from: $\mathbb{P}(A \mid B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)} \rightarrow \mathbb{P}(A \cap B) = \mathbb{P}(A \mid B) \cdot \mathbb{P}(B)$
  • Conditional independence is independence of two events given another event

Lecture 9 - April 14

  • Medical terms for tests:
    • $\mathbb{P}(D)$ is “prevalence”
    • $\mathbb{P}(T \mid D)$ is “sensitivity”
    • $\mathbb{P}(\bar{T} \mid \bar{D})$ is “specificity”
    • Think of these problems with a large population (where at least one for each chance)
  • Intuition Trick: Bayes’ Factor:
    • When you test positive, multiply prior by the Bayes’ Factor: $\frac{\text{sensitivity}}{\text{false positive rate}} = \frac{1 - FNR}{FPR}$
    • Also called the “likelihood ratio”
    • It is an estimate of how much I should update the prior by
    • Better when the prior is quite low
  • When there are overwhelming differences between groups, response error can drown out small groups

Lecture 10 - April 17

  • We often implicitly define the sample space:
    • This commonly occurs when the size of the sample space is infinite
  • When working with sample spaces with infinite size:
    • We can use infinite sums of probabilities (and closed forms)
    • We can use complement events
  • Random Variables:
    • It is any function that has domain $\Omega$ and outputs a real number
    • $X(\omega)$ is a summary of $\omega$
    • It doesn’t change the problem, but it can simplify it!
  • One sample space can have many random variables
  • We always use capital letters as random variables
  • We commonly use lowercase variables as the values the random variable could take on
  • Random variables are a function
    • We do not use typical function notation, instead something like: $X = 2$
  • The support (the range) is the set of values $X$ can take on
  • The event the random variable takes on a value: $\lbrace \omega : X(\omega) = x \rbrace$
    • The probability of that event is: $\mathbb{P}(X = x)$
  • The function that gives us $\mathbb{P}(X = x)$ is the Probability Mass Function, or PMF
    • This is written as: $p_X(x)$ or $f_X(x)$
    • Think of the PMF as a piecewise function where values outside the support are zero because it must take in all real numbers as an input
  • The Cumulative Distribution Function (CDF) gives the probability $X \leq x$
    • Written as: $F_X(x) = \mathbb{P}(X \leq x)$

Lecture 11 - April 19

  • CDF will always be defined over all real numbers (we must support all of them)
    • We will often use floor functions when we only want to “support” integers
    • Typically has zero below the support and one above the support
  • The “expectation” of a random variable $X$ is: $\mathbb{E}[X] = \sum_k k \cdot \mathbb{P}(X = k)$
    • The “weighted average” of values $X$ could take on
    • Weighted by the probability we actually see the value
    • Think about it as the expectation of a drive in football:
      • $X$ is the possible scores $0, 2, 3, 6, 7$
      • We multiply each score by the likelihood that it happens, sum all of these for our expected score
    • Not the most common outcome
  • The expectation of the sum of two random variables is equal to the sum of the expectations for each variable
  • Pairwise independece: for each pair in the set, they are independent
  • Mutual independence: for each subset of events, the probability of the intersection of all of them is equal to the product of all individual probabilities of the events
    • Stronger than pairwise independence
  • Two random variables $X$ and $Y$ are independent if for all $k$ and $l$ (in the supports), $\mathbb{P}(X = k, Y = l) = \mathbb{P}(X = k) \cdot \mathbb{P}(Y = l)$
    • Note that commas are often used instead of $\cap$ for random variables
  • Pairwise independence is the intuition
  • Mutual independence of random variables: $X_1, X_2, \dots, X_n$ are mutually independent if for all $x_1, x_2, \dots, x_n$ $\mathbb{P}(X_1 = x_1, X_2 = x_2, \dots, X_n = x_n) = \mathbb{P}(X_1 = x_1) \cdot \mathbb{P}(X_2 = x_2) \cdot \dots \cdot \mathbb{P}(X_n = x_n)$
    • We don’t need to check all subsets, but do need to check all possible values in the ranges
  • While many equations might want all values (outside) the range of a random variable to be checked or included, because the probability of it is zero they often don’t need to be

Lecture 12 - April 21

  • Linearity of expectation: for any two random variables $X$ and $Y$, $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$
    • This extends to more than two random variables
    • Simple summation proof
    • $X$ and $Y$ do not need to be independent
    • Constants are also fine (works like algebra intuition!)
  • Indicator random variables are 1 if the event occurs and zero if it does not
    • $\mathbf{1}[A]$
    • The expectation of the indicator variable is the probability the event occurs
  • How to compute complicated expectations:
    • Decompose the random variable into the sum of simple random variables
    • Apply linearity of expecation
    • Compute the simple expectations

Lecture 13 - April 24

  • Variance is another one-number summary of a random variable
  • We typically square values instead of using absolute values (think norms)
  • Variance: $Var(X) = \sum_{\omega} \mathbb{P}(\omega) \cdot (X(\omega) - \mathbb{E}[X])^2$
    • How “extreme” or “spread out” values are
    • $= \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2$
    • We should first find $\mathbb{E}[X]$
  • If $X$ and $Y$ are independent:
    • $Var(X + Y) = Var(X) + Var(Y)$
    • $\mathbb{E}[X \cdot Y] = \mathbb{E}[X] \cdot \mathbb{E}[Y]$
  • Squaring an indicator variable does nothing to it
    • Squaring random variables: we only square the value, not probability
  • $Var(X + c) = Var(X)$
    • Think of this as just shifting the distribution
  • $Var(aX) = a^2 Var(X)$
    • Stretching or compressing random variable

Lecture 14 - April 26

  • Introduces the random variable zoo: a list of facts about random variables (and distributions)
    • Use a reference sheet or wikipedia!
  • Memoryless - future outcomes aren’t influenced by past outcomes
  • “Independent and identially distributed”: “iid”
  • Bernoulli: one trial with a probability $p$ of success
  • Binomial: how many success in $n$ independent trials with probability $p$ of success?
  • Geometric: how many independent trials until the first success?
  • Uniform: integer in some range with each value equally likely
    • We need smallest and largest possible value for the range
  • Negative Binomial: how many independent trials with probability $p$ until we have $r$ successes?

Lecture 15 - April 28

  • Poisson Distribution: we know the average over an interval of time, we can’t use each individual possible source
    • (Kinda) requires each event to be independent
    • We use Poisson because we don’t have good ideas of what the random variables are
    • This is a model - not perfect (most real-world events are somewhat dependent), but it is useful!
    • The PMF involves the Taylor Series for $e^x$
    • It is a way of using the limit as the number of experiments approach infinity, but the mean stays constant
  • Hypergeometric: drawing without probability from an urn
  • Continuous random variables:
    • We need continuous probability spaces and continuous random variables to represent uncountably-infinite sample spaces
  • We use the probability density function for continuous random variables:
    • It is a way of comparing probability of being near different events
    • $f_x(k) \geq 0$
    • $\int_{-\infty}^{\infty} f_x(k) dk = 1$
    • We use this because we need the PDF to work for events
  • Integrating is analogous to summation - continuous vs discrete values:
    • $\mathbb{P}(a \leq X \leq B) = c$
    • $\int_{a}^{b} f_x(k) dk = c$
  • Impossible events still have probability 0, but some probability 0 events are still possible for continuous probability spaces

Lecture 16 - May 1

  • Continuous random variables require Probability Density Function:
    • Main difference is that we use events vs values
    • Every single value is equal to 0 (typically)
    • The PDF is the number that when integrated over gives the probability of an event
    • Comparing $f_X(k)$ and $f_X(l)$ gives relative chances of $X$ being near $k$ vs $l$
    • $\mathbb{P}(a \leq X \leq b) = \int_{a}^{b} f_X(z)dz $
    • Sometimes the density will be greater than 1
  • Cumulative density function is analogous when using continuous vs discrete random variables:
    • $F_X(k) = \mathbb{P}(X \leq k) = \int_{- \infty}^{k} f_X(z)dz$
  • CDF to PDF by taking the derivative of CDF
    • Undos the integral
    • Vice-versa works as well
  • General pattern: summation for discrete random variables becomes integration for continuous
  • $\mathbb{E}[X] = \int_{- \infty}^{\infty} X(z) \cdot f_X(z) dz$
  • Expectation of a function of a random variable:
    • $\mathbb{E}[g(X)] = \int_{- \infty}^{\infty} g(X(z)) \cdot f_X(z)dz$
    • “Law of the Unconcious Statistician” ~math nerds
  • Linearity of Expectations still works!
  • Variance: $Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = \int_{- \infty}^{\infty} f_x(k)(X(k) - \mathbb{E}[X])^2 dk$
  • Expectation of a uniform random varaible between $a$ and $b$ is the same, $\frac{a+b}{2}$
  • Expectation of the uniform random variable squared is: $\frac{a^2 + ab + b^2}{3}$
  • Variance of the variable is: $\frac{(b - a)^2}{12}$

Lecture 17 - May 5

  • Exponential random variable: like geometric random variable, but continuous time
    • “Waiting doesn’t make the event happen sooner” - meomoryless
    • $f_X(k) = \lambda e^{- \lambda k}$
    • $\mathbb{E}[X] = \frac{1}{\lambda}$
    • Same as taking a Poisson random variable and asking “How long until the next event?”
    • $F_X(t) = \mathbb{P}(X \leq t) = 1 - e^{- \lambda t}$
    • Expectation is: $\frac{1}{\lambda}$
    • Variance is: $\frac{1}{\lambda^2}$
  • Gaussian (normal) random variable:
    • Mean: $\mu$
    • Variance: $\sigma^2$
    • $f_X(x) = \frac{1}{\sigma \sqrt{2 \pi}} \cdot e^{- \frac{(x - \mu)^2}{2 \sigma^2}}$
    • $F_X(k)$ has no nice closed form: use the table
    • $\mathbb{E}[X] = \mu$
    • This is the bell curve
    • Scaling or adding to a normal variable results in another normal variable
  • To normalize normal variable $X$: $Y = \frac{X - \mu}{\sigma}$
    • We normalize because the CDF for normal random variables can be super rough
    • We convert to a “standard normal”, round the “z-score” to the hundredths, look up on the table

Lecture 18 - May 8

  • The sum of any independent random variables approaches the normal distribution
    • Let $X_1, X_2, \dots , X_n$ i.i.d. random variables with mean $\mu$ and variance $\sigma ^2$
    • Let $Y_n = \frac{X_1 + X_2 + \dots + X_n - n \mu}{\sigma \sqrt{n}}$
    • Then as $n \rightarrow \infty$, $Y_n$ converges to the CDF of $\mathbb{N}(0, 1)$
    • Only equal in the limit!
  • Gaussians often occur in the real world because the random variable is a combination of many independent factors
  • We can often use the Gaussian CDF in practice instead of complicated independent variables
  • Note that $\Phi$ represents the CDF of the Gaussian with mean 0 and variance 1
  • The z-table only has positive values: we use $\Phi(-x) = 1 - \Phi(x)$
  • When we have a discrete variable we’re approximating with a continuous random variable, we must do a continuity correction:
    • We find all values that would round to the correct discrete value
    • Other corrections are typically not worth the effort
  • CLT Usage Outline:
    1. Write down the event you’re interested in, in terms of the sum of random variables
    2. Apply continiuty correction if RVs are discrete
    3. Normalize RB to have mean 0 and standard deviation 1
    4. Replace RV with $\mathbb{N}(0,1)$
    5. Write event in terms of $\Phi$
    6. Look up in table
  • Polling: $\mathbb{P}(\vert X - \mathbb{E}[X] \vert \geq) s) \leq \epsilon$
    • How often $\epsilon$ is our polling outside an acceptable margin $s$

Lecture 19 - May 10

  • If two RVs are independent, the probability they both equal some values is equal to their individual products
  • We can use LTP to talk about only one variable (when they’re dependent)
    • This is the marginal distribution, as we marginalized a RV
    • The “marginal” for $X$ is where we marginalize all other RVs
  • Joint expectation is still the sum of the probabilities times the value
    • This is often written as a function
    • Conditional expectations are also intuitively defined
    • LOE still work!
  • Law of Total Expectation: basically the same as LTP, but with expectations
    • $\mathbb{E} = \sum_{i=1}^n \mathbb{E}[X \vert A_i] \mathbb{E}(A_i)$
  • Covariance: how much $X$ and $Y$ change together
    • $\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E})(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]$
    • We often only care about if $\text{Cov}$ is positive or negative
  • $\text{Variance}(X + Y) = \text{Variance}(X) + \text{Variance}(Y) + 2 \text{Cov}(X, Y)$

Lecture 20 - May 12

  • Two methods of polling: sampling with vs without replacement
    • We will do the math for sampling with replacement even if this isn’t used in practice
  • We don’t do continuity corrections when we’re polling
  • Accuracy of polling isn’t determined by the total population amount, only the number of people polled
    • Works for idealized polling scenarios with large-ish populations
    • This is due to the way we use the two methods of polling
  • The “margin of error” is an intuitive way of measuring variance
    • “If I performed this poll repeatedly, 95% of the time the correct value will be within +/- the margin of error”
  • Find the number of people necessary to guarantee a margin of error
    • Handling $\sqrt{p(1-p)}$, assume worst case scenario for $p = 0.5$
    • We use the z-table in reverse, find the smallest input that gives the correct output

Lecture 21 - May 15

  • We need a way to calculate the bounds with certainty
  • Tail bound: “bounds the size of the tail”
  • Markov Inequality:
    • Let $X$ be a random variable supported only on non-negative numbers:
      • $\mathbb{P}(X \geq t) \leq \frac{\mathbb{E}[X]}{t}$
      • $\mathbb{P}(X \geq k \mathbb{E}[X]) \leq \frac{1}{k}$
    • Non-negative random variable and $k$ or $t$ is positive