Interpreting Data using Statistical Models with Python
Aug 1, 2019 • 19 Minute Read
Introduction
Statistics provide answers to many important underlying patterns in the data. Statistical models help to concisely summarize and make inferences about the relationships between the variables. Predictive modeling is often incomplete without understanding these relationships.
In this guide, the reader will learn how to fit and analyze statistical models on quantitative (linear regression) and qualitative (logistic regression) target variables. We will be using the Statsmodels library for statistical modeling. We will begin by importing the libraries that we will be using.
Loading the Required Libraries and Modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import statsmodels.api as sm
Data
Data for Linear Regression
For building linear regression models, we will be using the fictitious data of loan applicants containing 600 observations and 10 variables. Out of the ten variables, we will be using the following six variables:
-
Dependents: Number of dependents of the applicant.
-
Is_graduate: Whether the applicant is a graduate ("1") or not ("0").
-
Loan_amount: Loan amount (in USD) for which the application was submitted.
-
Term_months: Tenure of the loan (in months).
-
Age: The applicant’s age in years.
-
Income: Annual Income of the applicant (in USD). This is the dependent variable.
Data for Logistic Regression
For building logistic regression models, we will be using the diabetes dataset which contains 768 observations and 9 variables, as described below:
-
pregnancies: Number of times pregnant.
-
glucose: Plasma glucose concentration.
-
diastolic: Diastolic blood pressure (mm Hg).
-
triceps: Skinfold thickness (mm).
-
insulin: Hour serum insulin (mu U/ml).
-
bmi: BMI (weight in kg/(height in m).
-
dpf: Diabetes pedigree function.
-
age: Age in years.
-
diabetes: '1' represents the presence of diabetes while '0' represents the absence of it. This is the target variable.
Linear Regression
Linear Regression models are models which predict a continuous label. The goal is to produce a model that represents the ‘best fit’ to some observed data, according to an evaluation criterion we choose. Good examples of this are predicting the price of the house, sales of a retail store, or life expectancy of an individual. Linear Regression models assume a linear relationship between the independent and the dependent variables.
Let us start by loading the data. The first line of code reads in the data as pandas dataframe, while the second line prints the shape of the data. The third line prints the first five observations of the data. We will try to predict 'Income' basis other variables.
# Load data
df = pd.read_csv("data_smodel.csv")
print(df.shape)
df.head(5)
Output:
(600, 10)
| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age | Sex |
|--- |---------------- |------------ |------------- |-------- |------------- |------------- |-------------- |----------------- |----- |----- |
| 0 | 0 | 0 | 0 | 362700 | 44500 | 384 | 0 | 0 | 55 | 1 |
| 1 | 0 | 3 | 0 | 244000 | 70000 | 384 | 0 | 0 | 30 | 1 |
| 2 | 1 | 0 | 0 | 286500 | 99000 | 384 | 0 | 0 | 32 | 1 |
| 3 | 0 | 0 | 1 | 285100 | 55000 | 384 | 0 | 0 | 68 | 1 |
| 4 | 0 | 0 | 1 | 320000 | 58000 | 384 | 0 | 0 | 53 | 1 |
Linear Regression - Univariate
We will start with a simple linear regression model with only one covariate, 'Loan_amount', predicting 'Income'.The lines of code below fits the univariate linear regression model and prints a summary of the result.
model_lin = sm.OLS.from_formula("Income ~ Loan_amount", data=df)
result_lin = model_lin.fit()
result_lin.summary()
Output:
| Dep. Variable: | Income | R-squared: | 0.587 |
|------------------- |------------------ |--------------------- |----------- |
| Model: | OLS | Adj. R-squared: | 0.587 |
| Method: | Least Squares | F-statistic: | 851.4 |
| Date: | Fri, 26 Jul 2019 | Prob (F-statistic): | 4.60e-117 |
| Time: | 22:02:50 | Log-Likelihood: | -8670.3 |
| No. Observations: | 600 | AIC: | 1.734e+04 |
| Df Residuals: | 598 | BIC: | 1.735e+04 |
| Df Model: | 1 | | |
| Covariance Type: | nonrobust | | |
| | coef | std err | t | P>|t| | [0.025 | 0.975] |
|------------- |----------- |---------- |-------- |------- |---------- |---------- |
| Intercept | 4.618e+05 | 2.05e+04 | 22.576 | 0.000 | 4.22e+05 | 5.02e+05 |
| Loan_amount | 0.7528 | 0.026 | 29.180 | 0.000 | 0.702 | 0.803 |
| Omnibus: | 459.463 | Durbin-Watson: | 1.955 |
|---------------- |--------- |------------------- |----------- |
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10171.070 |
| Skew: | 3.186 | Prob(JB): | 0.00 |
| Kurtosis: | 22.137 | Cond. No. | 8.69e+05 |
Interpretation of the Model Coefficient and the P-value
The central section of the output, where the header begins with coef, is important for model interpretation. The fitted model implies that, when comparing two applicants whose 'Loan_amount' differ by one unit, the applicant with the higher 'Loan_amount' will, on average, have 0.75 units higher 'Income'. This difference is statistically significant, because the p-value, shown under the column labeled ***P>|t|***, is less than the significance value of 0.05. This means that there is a strong evidence of a linear association between the variables 'Income' and 'Loan_amount'.
Interpretation of the R-squared Value
The other parameter to test the efficacy of the model is the R-squared value, which represents the percentage variation in the dependent variable (Income) that is explained by the independent variable (Loan_amount). The higher the value, the better the explainability of the model, with the highest value being one. In our case, the R-squared value of 0.587 means that 59% of the variation in the variable 'Income' is explained by the variable 'Loan_amount'.
Interpretation of the Correlation Coefficient
The Pearson correlation coefficient is also an indicator of the extent and strength of the linear relationship between the two variables. The lines of code below calculate and print the correlation coefficient, which comes out to be 0.766. This is a strong positive correlation between the two variables, with the highest value being one.
cc = df[["Income", "Loan_amount"]].corr()
print(cc)
Output:
Income Loan_amount
Income 1.00000 0.76644
Loan_amount 0.76644 1.00000
Linear Regression - Multivariate
In the previous section, we covered simple linear regression using one variable. However, in real world cases, we will deal with multiple variables. This is called multivariate regression. In our case, we will build the multivariate statistical model using five independent variables.
The lines of code below fits the multivariate linear regression model and prints the result summary. It is to be noted that the syntax Income ~ Loan_amount + Age + Term_months + Dependents + Is_graduate does not mean that these five variables are literally added together. Instead, it only means that these variables were included in the model as predictors of the variable 'Income'.
model_lin = sm.OLS.from_formula("Income ~ Loan_amount + Age + Term_months + Dependents + Is_graduate", data=df)
result_lin = model_lin.fit()
result_lin.summary()
Output:
| Dep. Variable: | Income | R-squared: | 0.595 |
|------------------- |------------------ |--------------------- |----------- |
| Model: | OLS | Adj. R-squared: | 0.592 |
| Method: | Least Squares | F-statistic: | 174.7 |
| Date: | Fri, 26 Jul 2019 | Prob (F-statistic): | 3.95e-114 |
| Time: | 22:04:27 | Log-Likelihood: | -8664.6 |
| No. Observations: | 600 | AIC: | 1.734e+04 |
| Df Residuals: | 594 | BIC: | 1.737e+04 |
| Df Model: | 5 | | |
| Covariance Type: | nonrobust | | |
| | coef | std err | t | P>|t| | [0.025 | 0.975] |
|------------- |----------- |---------- |-------- |------- |----------- |---------- |
| Intercept | 2.68e+05 | 1.32e+05 | 2.029 | 0.043 | 8575.090 | 5.27e+05 |
| Loan_amount | 0.7489 | 0.026 | 28.567 | 0.000 | 0.697 | 0.800 |
| Age | -856.1704 | 1265.989 | -0.676 | 0.499 | -3342.530 | 1630.189 |
| Term_months | 338.6069 | 295.449 | 1.146 | 0.252 | -241.644 | 918.858 |
| Dependents | 8437.9050 | 1.84e+04 | 0.460 | 0.646 | -2.76e+04 | 4.45e+04 |
| Is_graduate | 1.365e+05 | 4.56e+04 | 2.995 | 0.003 | 4.7e+04 | 2.26e+05 |
| Omnibus: | 460.035 | Durbin-Watson: | 1.998 |
|---------------- |--------- |------------------- |----------- |
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10641.667 |
| Skew: | 3.173 | Prob(JB): | 0.00 |
| Kurtosis: | 22.631 | Cond. No. | 5.66e+06 |
Interpretation of the Model Coefficient and the P-value
The output above shows that, when the other variables remain constant, if we compare two applicants whose 'Loan_amount' differ by one unit, the applicant with higher 'Loan_amount' will, on average, have 0.75 units higher 'Income'.
Using the P>|t| result, we can infer that the variables 'Loan_amount' and 'Is_graduate' are the two statistically significant variables, as their p-value is less than 0.05.
Whenever a categorical variable is used as a covariate in a regression model, one level of the variable is omitted and is automatically given a coefficient of zero. This level is called the reference level of the covariate. In the model above, 'Is_graduate' is a categorical variable, and only the coefficient for 'Graduate' applicants is included in the regression output, while 'Not Graduate' is the reference level.
Interpretation of the R-squared Value
The R-squared value marginally increased from 0.587 to 0.595, which means that now 59.5% of the variation in 'Income' is explained by the five independent variables, as compared to 58.7% earlier. The marginal increase could be because of the inclusion of the 'Is_graduate' variable that is also statistically significant.
Logistic Regression
Logistic Regression is a type of generalized linear model which is used for classification problems. The goal is to predict a categorical outcome, such as predicting whether a customer will churn or not, or whether a bank loan will default or not.
In this guide, we will be building statistical models for predicting a binary outcome, meaning an outcome that can take only two distinct values. Let us start by loading the data. The first line of code reads in the data as pandas dataframe, while the second line prints the shape of the data. The third line prints the first five observations. We will try to predict 'diabetes' basis other variables.
df2 = pd.read_csv("diabetes.csv")
print(df2.shape)
df2.head(5)
Output:
(768, 9)
| | pregnancies | glucose | diastolic | triceps | insulin | bmi | dpf | age | diabetes |
|--- |------------- |--------- |----------- |--------- |--------- |------ |------- |----- |---------- |
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
Logistic Regression - Univariate
We will start with the basic logistic regression model with only one covariate, 'age', predicting 'diabetes'.The lines of code below fits the univariate logistic regression model and prints the result summary.
model = sm.GLM.from_formula("diabetes ~ age", family=sm.families.Binomial(), data=df2)
result = model.fit()
result.summary()
Output:
| Dep. Variable: | diabetes | No. Observations: | 768 |
|----------------- |------------------ |------------------- |--------- |
| Model: | GLM | Df Residuals: | 766 |
| Model Family: | Binomial | Df Model: | 1 |
| Link Function: | logit | Scale: | 1.0 |
| Method: | IRLS | Log-Likelihood: | -475.36 |
| Date: | Fri, 26 Jul 2019 | Deviance: | 950.72 |
| Time: | 22:08:35 | Pearson chi2: | 761. |
| No. Iterations: | 4 | | |
| | coef | stderr | z | P>|z| | [0.025 | 0.975] |
|----------- |--------- |------- |-------- |------- |-------- |-------- |
| Intercept | -2.0475 | 0.239 | -8.572 | 0.000 | -2.516 | -1.579 |
| age | 0.0420 | 0.007 | 6.380 | 0.000 | 0.029 | 0.055 |
Using the P>|t| result above, we can conclude that the variable 'age' is an important predictor for 'diabetes', as the value is less than 0.05.
Logistic Regression - Multivariate
As with linear regression, we can also include multiple variables in a logistic regression model. Below we fit a logistic regression for 'diabetes' using all the other variables.
model = sm.GLM.from_formula("diabetes ~ age + pregnancies + glucose + triceps + diastolic + insulin + bmi + dpf", family=sm.families.Binomial(), data=df2)
result = model.fit()
result.summary()
Output:
| Dep. Variable: | diabetes | No. Observations: | 768 |
|----------------- |------------------ |------------------- |--------- |
| Model: | GLM | Df Residuals: | 759 |
| Model Family: | Binomial | Df Model: | 8 |
| Link Function: | logit | Scale: | 1.0 |
| Method: | IRLS | Log-Likelihood: | -361.72 |
| Date: | Fri, 26 Jul 2019 | Deviance: | 723.45 |
| Time: | 22:08:49 | Pearson chi2: | 836. |
| No. Iterations: | 5 | | |
| | coef | std err | z | P>|z| | [0.025 | 0.975] |
|------------- |--------- |------- |--------- |------- |-------- |-------- |
| Intercept | -8.4047 | 0.717 | -11.728 | 0.000 | -9.809 | -7.000 |
| age | 0.0149 | 0.009 | 1.593 | 0.111 | -0.003 | 0.033 |
| pregnancies | 0.1232 | 0.032 | 3.840 | 0.000 | 0.060 | 0.186 |
| glucose | 0.0352 | 0.004 | 9.481 | 0.000 | 0.028 | 0.042 |
| triceps | 0.0006 | 0.007 | 0.090 | 0.929 | -0.013 | 0.014 |
| diastolic | -0.0133 | 0.005 | -2.540 | 0.011 | -0.024 | -0.003 |
| insulin | -0.0012 | 0.001 | -1.322 | 0.186 | -0.003 | 0.001 |
| bmi | 0.0897 | 0.015 | 5.945 | 0.000 | 0.060 | 0.119 |
| dpf | 0.9452 | 0.299 | 3.160 | 0.002 | 0.359 | 1.531 |
The above output shows that adding other variables to the model leads to a big shift in the age parameter (its p-value increased to over the significance level of 0.05). This can happen in statistical models while adding or removing other variables from a model.
Looking at the p-values, the variables 'age', 'triceps', and 'insulin', seem to be insignificant predictors. All the other variables have their p-values smaller than 0.05, and are, therefore, significant.
The interpretation of logistic models is different in the manner that the coefficients are understood from the logit perspective. In simple terms, it means that, for the output above, the log odds for 'diabetes' increases by 0.09 for each unit of 'bmi', 0.03 for each unit of 'glucose', and so on.
As with linear regression, the roles of 'bmi' and 'glucose' in the logistic regression model is additive, but here the additivity is on the scale of log odds, not odds or probabilities.
Conclusion
In this guide, you have learned about interpreting data using statistical models. You also learned about using the Statsmodels library for building linear and logistic models - univariate as well as multivariate. You also learned about interpreting the model output to infer relationships, and determine the significant predictor variables.
To learn more about data preparation and building machine learning models using Python's 'scikit-learn' library, please refer to the following guides:
- Scikit Machine Learning
- Linear, Lasso, and Ridge Regression with scikit-learn
- Non-Linear Regression Trees with scikit-learn
- Machine Learning with Neural Networks Using scikit-learn
- Validating Machine Learning Models with scikit-learn
- Ensemble Modeling with scikit-learn
- Preparing Data for Modeling with scikit-learn
- Interpreting Data Using Descriptive Statistics with Python