{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f540ca19",
   "metadata": {},
   "source": [
    "# Exam exercise: Logistic regression analysis of Berkely admission data\n",
    "\n",
    "First load the packages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a154e390",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import statsmodels.api as sm\n",
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13bd111c",
   "metadata": {},
   "source": [
    "The following table shows the total number of admitted and rejected\n",
    "applicants to the six largest departments at University of Berkeley in 1973.\n",
    "\n",
    "|       | Admitted| Rejected|\n",
    "|:------|--------:|--------:|\n",
    "|Male   |     1198|     1493|\n",
    "|Female |      557|     1278|\n",
    "\n",
    "Use a $\\chi^2$-test to check whether the admission statistics for\n",
    "Berkeley show any sign of gender discrimination. To enter the table\n",
    "in Python you can do:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27d765cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "admit = pd.DataFrame(\n",
    "    [[1198, 1493],\n",
    "     [ 557, 1278]],\n",
    "    index=[\"Male\", \"Female\"],\n",
    "    columns=[\"Admitted\", \"Rejected\"]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11479fc5",
   "metadata": {},
   "source": [
    "Your analysis should as a minimum contain **arguments** that support: \n",
    "\n",
    "- Statement of hypotheses\n",
    "- Calculation of expected frequencies\n",
    "- Calculation of test statistic\n",
    "- Calculation and interpretation of p-value."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e08f277",
   "metadata": {},
   "source": [
    "A more detailed data set with the admissions for each department is \n",
    "available on the course web page. The variables are:\n",
    "\n",
    "- `Gender` (male/female)\n",
    "- `Dept` (department A, B, C, D, E, F)\n",
    "- `Admit` (frequency of admitted for each combination)\n",
    "- `Reject` (frequency of rejected for each combination)\n",
    "\n",
    "Load the data into Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d42c830c",
   "metadata": {},
   "outputs": [],
   "source": [
    "admission = pd.read_csv(\"http://asta.math.aau.dk/dan/static/datasets?file=admission.dat\",sep='\\s+',header=0)\n",
    "admission"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c4105ad",
   "metadata": {},
   "source": [
    "In order to do logistic regression for this kind of data, the response is the columns `Admit` and `Reject` (which means that we model the probability of admit) :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcd7b287",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create proportion response and total counts\n",
    "admission['Total'] = admission['Admit'] + admission['Reject']\n",
    "admission['Prop'] = admission['Admit'] / admission['Total']\n",
    "\n",
    "# fit model\n",
    "m0 = smf.glm('Prop ~ Gender + Dept',\n",
    "             data=admission,\n",
    "             family=sm.families.Binomial(),\n",
    "             freq_weights=admission['Total'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e64877f7",
   "metadata": {},
   "source": [
    "The glm-object `m0` is a logistic model with main effects of `Gender`\n",
    "and `Department`.\n",
    "\n",
    "- Investigate whether there is any effect of these predictors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e658f53d",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(m0.fit().summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31230a8e",
   "metadata": {},
   "source": [
    "Looking at the summary of `m0`:\n",
    "\n",
    "- Is there a significant gender difference?\n",
    "- What is the interpretation of the numbers in the `Dept[T.B]`-row?\n",
    "\n",
    "We add the standardized residuals to `admission`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c90019b",
   "metadata": {},
   "outputs": [],
   "source": [
    "n = admission['Total'].to_numpy()\n",
    "y = admission['Admit'].to_numpy()\n",
    "mu = m0.fit().fittedvalues * n\n",
    "h = m0.fit().get_influence().hat_matrix_diag\n",
    "rstandard = (y - mu) / np.sqrt(n * m0.fit().fittedvalues * (1 - m0.fit().fittedvalues)) / np.sqrt(1 - h)\n",
    "admission['stdRes'] = rstandard\n",
    "print(admission)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "971f72ff",
   "metadata": {},
   "source": [
    "- Looking at the standardized residuals, which department deviates\n",
    "  heavily from the model?\n",
    "- What gender is discrimated in this department?\n",
    "  \n",
    "Next you should fit the model with the interaction `Gender*Dept`.\n",
    "\n",
    "- Explain what interaction means in the current context.\n",
    "- Is there a significant interaction?\n",
    "- In the light of your analysis, explain the reason for your\n",
    "  answer to the previous question."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}