{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "727d7bb6",
      "metadata": {
        "id": "727d7bb6"
      },
      "source": [
        "# # Exam exercise: Probalilities, CLT and boxplots\n",
        "\n",
        "It is highly recommended that you answer the exam using Rmarkdown\n",
        "(you can simply use the exam Rmarkdown file as a starting point).\n",
        "\n",
        "# Part I: Estimating probabilities\n",
        "\n",
        "Remember to load packages first:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a499a167",
      "metadata": {
        "id": "a499a167"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "import scipy.stats as stats"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "4e3378b8",
      "metadata": {
        "id": "4e3378b8"
      },
      "source": [
        "## EU climate data\n",
        "\n",
        "In a recent survey from Eurobarometer you can extract data for\n",
        "response to the following question:\n",
        "\n",
        "*Do you consider climate change to be the single most serious problem facing the world as a whole?*\n",
        "\n",
        "The data are divided according to whether the respondent comes from\n",
        "Denmark or not."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "413313a4",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "413313a4",
        "outputId": "0d815287-cc3b-42b5-f208-173a606523ec"
      },
      "outputs": [],
      "source": [
        "data = [[309, 4918],\n",
        "        [1010, 25830]]\n",
        "rows = [\"Denmark\", \"Rest of EU\"]\n",
        "cols = [\"Yes\", \"No\"]\n",
        "climate = pd.DataFrame(data, index=rows, columns=cols)\n",
        "print(climate)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "7414b749",
      "metadata": {
        "id": "7414b749"
      },
      "source": [
        "-   Estimate the probability of answering \"Yes\" to the question.\n",
        "\n",
        "-   Make a 95% confidence interval for the probability of answering \"Yes\".\n",
        "\n",
        "-   Estimate the probability of answering \"Yes\" given that you come\n",
        "    from Denmark.\n",
        "\n",
        "-   What would the true population probabilities satisfy if `origin`\n",
        "    and `answer` were\n",
        "    statistically independent? Based on your results do you think they\n",
        "    are independent?"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "fa8dde8b",
      "metadata": {
        "id": "fa8dde8b"
      },
      "source": [
        "# Part II: Sampling distributions and the central limit theorem\n",
        "\n",
        "This is a purely theoretical exercise where we investigate the random\n",
        "distribution of samples from a known population.\n",
        "\n",
        "## House prices in Denmark\n",
        "\n",
        "The Danish real estate agency HOME has a database containing approximately\n",
        "80,000 house prices for one-family houses under DKK 10 million for the period\n",
        "2004-2016. The house prices (without all the additional information such as\n",
        "house size, address etc.) are available as a **R** data file `Home.RData` on the\n",
        "course webpage. Since the format is intended for **R** rather than Python, we will\n",
        "need a couple of packages to read the data into Python. One of them is not installed into Google Colab, so we need to install this first (If you are using a different platform than Google Colab, you need to install the packages, the way you usually do this)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "21qebS-xACwR",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "21qebS-xACwR",
        "outputId": "fda22901-8474-4e1f-9373-d50887e70d42"
      },
      "outputs": [],
      "source": [
        "!pip install pyreadr\n",
        "\n",
        "import pyreadr\n",
        "import requests"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "565138e8",
      "metadata": {
        "id": "565138e8"
      },
      "outputs": [],
      "source": [
        "# URL to the RData file\n",
        "url = \"https://asta.math.aau.dk/datasets?file=Home.RData\"\n",
        "\n",
        "# Download the file\n",
        "file_path = \"Home.RData\"\n",
        "r = requests.get(url)\n",
        "with open(file_path, \"wb\") as f:\n",
        "    f.write(r.content)\n",
        "\n",
        "# Load the RData file\n",
        "result = pyreadr.read_r(file_path)\n",
        "price = np.array(result['price']).flatten()   # Note: 'price' is the only variable in the dataset, so we give this a convenient form and a short name"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "401c8cad",
      "metadata": {
        "id": "401c8cad"
      },
      "source": [
        "Make a histogram of all the house prices inserted in a new code chunk\n",
        "(try to do experiments with the number of bins):\n",
        "\n",
        "- Explain how a histogram is constructed.\n",
        "- Does this histogram look like a normal distribution?\n",
        "\n",
        "In this database (our population) the mean price and the\n",
        "standard deviation is is given by the following:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ac074f00",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ac074f00",
        "outputId": "97947c8b-f2a0-40dc-df47-7d14a787ac4f"
      },
      "outputs": [],
      "source": [
        "print(price.mean())\n",
        "print(price.std())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "581e2b7a",
      "metadata": {
        "id": "581e2b7a"
      },
      "source": [
        "In many cases access to such databases is restrictive and in the following we\n",
        "imagine that we are only allowed access to a random sample of 40 prices and the\n",
        "mean of this sample will be denoted `y_bar`.\n",
        "\n",
        "Before obtaining this sample we will use the Central Limit Theorem (CLT) to\n",
        "predict the distribution of `y_bar`:\n",
        "\n",
        "- What is the expected value of `y_bar`?\n",
        "\n",
        "- What is the standard deviation of `y_bar` (also called the standard error)?\n",
        "\n",
        "- What is the approximate distribution of `y_bar`?\n",
        "\n",
        "Now make a random sample of 40 house prices and calculate the sample mean:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1e8aa547",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1e8aa547",
        "outputId": "111ec705-6b9f-4ea8-9c89-1365239ce1e9"
      },
      "outputs": [],
      "source": [
        "y = np.random.choice(price, 40, replace=False)  # sample 40 without replacement\n",
        "mean_y = np.mean(y)\n",
        "print(mean_y)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "dc1fadeb",
      "metadata": {
        "id": "dc1fadeb"
      },
      "source": [
        "Repeat this command a few times. Is each mean price close to what you expected?\n",
        "\n",
        "Use the following code to repeat the sampling 500 times and save each mean value in the\n",
        "array `y_bar`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "555e0ba6",
      "metadata": {
        "id": "555e0ba6"
      },
      "outputs": [],
      "source": [
        "y_bar = np.array([np.mean(np.random.choice(price, 40, replace=False)) for _ in range(500)])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6602ff8c",
      "metadata": {
        "id": "6602ff8c"
      },
      "source": [
        "Calculate the mean and standard deviation of the values in `y_bar`.\n",
        "\n",
        "- How do they match with what you expected?\n",
        "\n",
        "- Make a histogram of the values in `y_bar` and add the density curve for the\n",
        "approximate distribution you predicted previously using `gf_dist`.\n",
        "For example if you predicted a normal distribution with\n",
        "mean 2 and standard deviation 0.25:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f624d766",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 430
        },
        "id": "f624d766",
        "outputId": "125cc287-4721-4da3-fa48-ccf8dd74108b"
      },
      "outputs": [],
      "source": [
        "# Change these values based on your data\n",
        "mean_val = 2\n",
        "sd_val = 0.25\n",
        "\n",
        "# Plot histogram (and save the range of x-values (bin_edges) for plotting the normal curve below)\n",
        "count, bins_edges, _ = plt.hist(y_bar, density=True, alpha=0.6, color='blue', edgecolor='black')\n",
        "\n",
        "# Overlay normal distribution curve\n",
        "x = np.linspace(min(bins_edges), max(bins_edges), 1000)\n",
        "plt.plot(x, stats.norm.pdf(x, loc=mean_val, scale=sd_val), color='red', linewidth=2)\n",
        "\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "632dbb77",
      "metadata": {
        "id": "632dbb77"
      },
      "source": [
        "- Make a boxplot of `y_bar` and explain how a boxplot is constructed."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a1020217",
      "metadata": {
        "id": "a1020217"
      },
      "source": [
        "# Part III: Theoretical boxplot for a normal distribution\n",
        "\n",
        "Finally, consider the theoretical boxplot of a general normal distribution with\n",
        "mean $\\mu$ and standard deviation $\\sigma$, and find the probability of being an\n",
        "outlier according to the 1.5 $\\cdot$ IQR criterion:\n",
        "\n",
        "- First find the $z$-score of the lower/upper quartile. I.e. the value of $z$ such that\n",
        "  $\\mu \\pm z\\sigma$ is the lower/upper quartile.\n",
        "\n",
        "- Use this to find the IQR (expressed in terms of $\\sigma$).\n",
        "\n",
        "- Now find the $z$-score of the maximal extent of the whisker. I.e. the value of $z$ such that\n",
        "  $\\mu \\pm z\\sigma$ is the endpoint of lower/upper whisker.\n",
        "\n",
        "- Find the probability of being an outlier.\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}