Week 6: Universality of the Uniform, Normal, Expo, and Moments

0. Logistical Info

  • Section date: 10/25
  • Associated lectures: 10/17, 10/19
  • Associated pset: Pset 6, due 10/27
  • Office hours on 10/25 from 7-9pm at Quincy Dining Hall
  • Please reach out if you wanted to sign up for a midterm debrief and missed the chance
  • Remember to fill out the attendance form

0.1 Summary + Practice Problem PDFs

Summary + Practice Problems PDF

Practice Problem Solutions PDF

1. Universality of the Uniform

Recall that the standard uniform, $U \sim \mathrm{Unif}(0, 1)$, has support $(0, 1)$ with PDF $1$ in the support.

Universality of the Uniform (UoU): If $F$ is a valid CDF that is continuous and strictly increasing over the support, then

  1. Let $U \sim \mathrm{Unif}(0, 1)$. Then $F^{-1} (U)$ is a random variable with CDF $F$.
  2. Let $X$ have CDF $F$. Then $F(X) \sim \mathrm{Unif}(0,1)$.

The first result applies to discrete random variables as well. The second result only works for continuous random variables.

This result is quite useful for simulation - if you have access to draws from a Uniform distribution, then you can transform them into draws from any distribution with a known (inverse) CDF.

We can prove UoU with the tools we’ve learned in class. For continuous random variables with $F$ as described in the theorem,

  1. For $x \in \mathbb{R}$, \begin{align*} P(F^{-1}(U) < x) = P(F(F^{-1}(U)) < F(x)) = P(U < F(x)) = F(x). \end{align*} So $F^{-1}(U)$ has CDF $F$. We used the CDF of $U$ in the last step, since $F(x) \in [0, 1]$.
  2. For $u \in [0, 1]$, \begin{align*} P(F(X) < u) &= P\left(F^{-1}(F(X)) < F^{-1}(u)\right)\\ &= P(X < F^{-1}(u)) = F(F^{-1}(u)) = u, \end{align*} so $F(X) \sim \mathrm{Unif}(0, 1)$ since it has the CDF of a standard uniform.

2. Normal distribution

2.1 Standard Normal

$Z \sim \mathcal{N}(0, 1)$ is a standard Normal random variable with support $\mathbb R$. We notate the CDF as $\Phi$ and PDF as $\phi$.

  • (Symmetry) The standard Normal is symmetric about $0$. In math, for $x \in \mathbb R$, $\phi(x) = \phi(-x)$.
    • This also implies that $\Phi(x) = 1 - \Phi(-x)$.
    • So $\Phi(0) = 0.5$.
    • For $Z \sim \mathcal{N}(0, 1)$, $-Z \sim \mathcal{N}(0, 1)$ as well.
  • (Empirical rule/68-95-99.7 rule) \begin{align*} P(-1 < Z < 1) &\approx 0.68,\\ P(-2 < Z < 2) &\approx 0.95,\\ P(-3 < Z < 3) &\approx 0.997. \end{align*}

In this class, you can give exact answers in terms of $\Phi$ and $\phi$. On psets, you should also use a calculator/programming language/the empirical rule to get numerical approximations of $\Phi$.

2.2 Normal

$X \sim \mathcal{N}(\mu, \sigma^2)$ (with $\mu \in \mathbb R, \sigma > 0$) is a Normal random variable with mean $\mu$ and variance $\sigma^2$, and also has support $\mathbb R$.

  • (Location-scale) For $Z \sim \mathcal{N}(0, 1)$, $\mu + \sigma Z \sim \mathcal{N}(\mu, \sigma^2)$.

    More generally, for $X \sim \mathcal{N}(\mu_1, \sigma_1^2)$, $\mu_2 + \sigma_2 X \sim \mathcal{N}(\mu_2 + \mu_1 \sigma_2, \sigma_1^2 \sigma_2^2)$.

  • (Standardization) For $X \sim \mathcal{N}(\mu, \sigma^2)$, $\frac{X-\mu}{\sigma} \sim \mathcal{N}(0, 1)$.\ We often use this to get results in terms of $\Phi$: \begin{align*} P(X < x) = P(\frac{X-\mu}{\sigma} < \frac{x-\mu}{\sigma}) = \Phi(\frac{x-\mu}{\sigma}). \end{align*}

  • (Empirical rule) For $X \sim \mathcal{N}(\mu, \sigma^2)$, \begin{align*} P(\mu-\sigma < X < \mu+\sigma) &\approx 0.68\\ P(\mu-2\sigma < X < \mu+2\sigma) &\approx 0.95\\ P(\mu-3\sigma < X < \mu+3\sigma) &\approx 0.997 \end{align*}

  • (Sum of independent Normals) Let $X \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $Y \sim \mathcal{N}(\mu_2, \sigma_2^2)$ with $X, Y$ independent. Then \begin{align*} X + Y &\sim \mathcal{N}(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2),\\ X - Y &\sim \mathcal{N}(\mu_1 - \mu_2, \sigma_1^2 + \sigma_2^2). \end{align*}

(Variance when subtracting) See that we always add the variance above! This is also a general rule: for any independent random variables $X$ and $Y$, \begin{align*} Var(X+Y) = Var(X - Y) = Var(X) + Var(Y). \end{align*} See that this is consistent with the fact that $Var(-Y) = (-1)^2 Var(Y) = Var(Y)$.

3. Exponential distribution

$X \sim \mathrm{Expo}(\lambda)$ is an Exponential random variable with mean $\frac{1}{\lambda}$ and variance $\frac{1}{\lambda^2}$. $\lambda$ is called the rate parameter.

  • (Memorylessness) For $X \sim \mathrm{Expo}(\lambda)$ and any $s, t > 0$, the memoryless property of the Exponential distribution states the following (equivalent) results: \begin{align*} P(X > s + t \vert X > s) &= P(X > t)\\ (X - s \vert X > s) &\sim \mathrm{Expo}(\lambda). \end{align*} See specifically that $X-s | X>s$ is independent of the value of $s$.

    The Exponential distribution is the only continuous distribution with this property. Additionally, the Geometric distribution is the only discrete distribution with support ${0, \ldots, }$ that is memoryless.

For most results we talk about, you can’t put a random variable in the place of a constant - you might recall from last week’s problem set that we couldn’t let the sum of $N$ independent $\mathrm{Pois}(\lambda)$ r.v.s, with $N$ random, be distributed $\mathrm{Pois}(N\lambda)$. However, with memorylessness, you can put random variables in the place of the $s$ above - so for some random variable $Y$, $P(X > t +Y | X > Y) = P(X > t)$ and $(X-Y|X > Y) \sim \mathrm{Expo}(\lambda)$ still.

Click for proof

We can prove by using LOTP and applying the constant version of memorylessness. We’ll assume $Y$ is discrete here, but continuous case is analogous (swap sums for integrals, PMFs for PDFs). \begin{align*} P(X > t+Y | X > Y) &= \sum_{y} P(X > t+y | X > Y, Y = y) P(Y=y)\\ &= \sum_{y} P(X > t+y | X > y, Y = y) P(Y=y). \end{align*} We’ll take a brief sidebar to show that $P(X > t + y | X > y , Y = y) = P(X > t+y | X>y)$; I think you can jump from the former to the latter using unconditional independence of $X$ and $Y$ since the extra condition is a function of $X$, but we’ll be explicit here. We will use the definition of conditional probability, the fact that $X>t+y$ implies that $X>y$, and the unconditional independence of $X$ and $Y$. \begin{align*} P(X > t+y | X > y, Y = y) &= \frac{P(X > t+y, X > y, Y =y)}{P(X > y, Y = y)}\\ &= \frac{P(X > t+y, Y =y)}{P(X > y ,Y =y)}\\ &= \frac{P(X > t+y) P(Y = y) }{P(X>y)P(Y=y)}\\ &= \frac{P(X>t+y)}{P(X>y)}\\ &= \frac{P(X>t+y, X>y)}{P(X>y)}\\ &= P(X>t+y | X>y). \end{align*} With this information, \begin{align*} P(X > t+Y | X >Y) &= \sum_y P(X>t+y|X>y, Y=y) P(Y=y)\\ &= \sum_y P(X>t+y|X>y) P(Y=y)\\ &= \sum_y P(X>t) P(Y=y)\\ &= P(X>t) \sum_y P(Y=y)\\ &= P(X>t) (1) = P(X>t), \end{align*} where we use memorylessness to say $P(X>t+y | X>y) = P(X>t)$.

  • (Example of Memorylessness) Suppose you’re waiting for a bus that will arrive in $X \sim \mathrm{Expo}(\lambda)$ minutes. If you wait for the bus for 10 minutes and it has not arrived, then the remaining time that you have to wait is still distributed $\mathrm{Expo}(\lambda)$: $X - 10 | X > 10 \sim \mathrm{Expo}(\lambda)$. So no matter how long you wait, the remaining time for you to wait has the same distribution.
  • (Minimum of Expos) The minimum of $n$ i.i.d. $\mathrm{Expo}(\lambda)$ random variables is distributed $\mathrm{Expo}(n\lambda)$. In notation, for $X_1, \ldots, X_n \overset{i.i.d.}{\sim} \mathrm{Expo}(\lambda)$, $\min(X_1, \ldots, X_n) \sim \mathrm{Expo}(n\lambda)$.

Maximum of Expos

The maximum of $n$ i.i.d. Exponential distributions is not does not follow an Exponential distribution.

Finding the distribution of minimums/maximums

The results above can be found in the book, but they provide a general template for finding the distributions of minimums and maximums.

Let $X_1, \ldots, X_n$ be any random variables. Then the events ${\min(X_1, \ldots, X_n) > x}$ and $(X_1 > x) \cap (X_2 > x) \cap \cdots \cap (X_n > x)$ are equivalent. To convince yourself of this, think about what this means in words: the minimum of a set of numbers is greater than $x$ if and only if each one of the numbers is great than $x$.

To find the CDF of $\min(X_1, \ldots, X_n)$, a common workflow is \begin{align*} P(\min(X_1, \ldots, X_n) \le x) &= 1 - P(\min(X_1, \ldots, X_n) > x) = 1 - P(X_1 > x, X_2 > x, \ldots, X_n > x). \end{align*} If $X_1, \ldots, X_n$ are independent, then we can get that \begin{align*} P(X_1 > x, X_2 > x, \ldots, X_n > x) &= P(X_1 > x) P(X_2 > x) \cdots P(X_n > x) \end{align*} If $X_1, \ldots, X_n$ are also identically distributed, we conclude with \begin{align*} P(X_1 > x) P(X_2 > x) \cdots P(X_n > x) &= (P(X_1 > x))^n. \end{align*}

For maximums, we follow a similar workflow, except instead using the fact that $${\max(X_1, \ldots, X_n) < x} = \bigcap_{i=1}^n (X_i < x).$$

4. Moments/Moment Generating Functions

For a random variable $X$, the $\mathbf{n^{th}}$ moment is $E(X^n)$.

Moment Generating Function

For a random variable $X$, the moment generating function (MGF) is $M_X(t) = E(e^{tX})$ for $t \in \mathbb{R}$. If the MGF exists, then \begin{align*} M_X(0) &= 1,\ \frac{d^n}{dt^n} M_X(t) |_{t=0} = M_X^{(n)}(t) &= E(X^n). \end{align*} You should sanity-check that $M_X(0) = 1$ whenever you calculate an MGF.

Previous
Next