TowardsMachineLearning

ANOVA Test | COVID-19

ANOVA Test | COVID-19 India

Introduction :-

A fact is a simple statement that everyone believes. It’s innocent , unless found guilty . A Hypothesis is a novel suggestion that no one wants to believe. It’s guilty , until found effective.

Edward Teller

In this COVID-19 pandemic where researchers all over the globe are trying to develop a vaccine or a cure for this Corona Virus. Meanwhile doctors are doing their best to give treatment to an infected patient. Now consider a scenario where doctors have 4 medical treatments to apply to cure the patients. Once we have the test results, one approach is to assume that the treatment which took the least time to cure the patients is the best among them.

But what if some of these patients had already been partially cured, or if any other medication was already working on them? Possible, right.. Since doctors are trying all medicines that could work to cure patient lying on bed.

In order to make a confident and reliable decision, we will need evidence to support our approach. This is where the concept of ANOVA comes into play.

In this article, I’ll explain you to ANOVA test and its different types that are being used to take better decisions. The icing on the cake? I’ll demonstrate each type of ANOVA-test in Python to visualize how they work on COVID-19 Data. So let’s get going!

Note:- You must know the basics of statistics to understand this topic. Knowledge of t-tests and Hypothesis testing would be an additional benefit

What is ANOVA test:-

An Analysis of Variance Test or an ANOVA can be thought of a generalization of the t-tests for more than 2 groups. The independent t-test is used to compare the means of a condition between 2 groups. ANOVA is used when one wants to compare the means of a condition between more than 2 groups.

ANOVA tests if there is a difference in the mean somewhere in the model (testing if there was an overall effect), but it does not tell one where the difference is, if the there is one. To find out where the difference is between the groups, one has to conduct post-hoc tests. This is also covered in this section.

Null Hypothesis:- There is no significant difference among the groups.

Alternate Hypothesis:- There is a significant difference among the groups.

Basically Anova test is performed by comparing two types of variation, the variation between the sample means, as well as the variation within each of the samples. The below mentioned formula represents one-way Anova test statistics:

The result of the ANOVA formula, the F statistic (also called the F-ratio), allows for the analysis of multiple groups of data to determine the variability between samples and within samples.

The formula for one way ANOVA test can be written as follows-

When you plot the ANOVA table ,all above components can be seen in it as below-

Anova Table

In general, if the p-value associated with the F is smaller than 0.05, then the null hypothesis is rejected and the alternative hypothesis is supported. If the null hypothesis is rejected, one concludes that the means of all the groups are not equal.

Note:- If no real difference exists between the tested groups, which is called the null hypothesis, the result of the ANOVA's F-ratio statistic will be close to 1.

Assumptions of ANOVA Test:-

  1. The observations are obtained independently and randomly from the population defined by the factor levels.
  2. The data of each factor level are normally distributed.
  3. Independence of cases: the sample cases should be independent of each other.
  4. Homogeneity of variance: Homogeneity means that the variance among the groups should be approximately equal.

The assumption of homogeneity of variance can be tested using tests such as Levene’s test or the Brown-Forsythe Test.  Normality of the distribution of the scores can be tested using histograms, the values of skewness and kurtosis, or using tests such as Shapiro-Wilk or Kolmogorov-Smirnov or Q-Q plot. The assumption of independence can be determined from the design of the study.

It is important to note that ANOVA is not robust to violations to the assumption of independence. This is to say, that even if you violate the assumptions of homogeneity or normality, you can conduct the test and basically trust the findings. However, the results of the ANOVA are invalid if the independence assumption is violated. In general, with violations of homogeneity the analysis is considered robust if you have equal sized groups. With violations of normality, continuing with the ANOVA is generally ok if you have a large sample size.

Types of ANOVA Test:-

  1. One-Way ANOVA:- A one-way ANOVA has just one independent variable.
    • For example, difference in Corona cases can be assessed by Country, and Country can have 2, 20, or more different categories to compare.
  2. Two-Way ANOVA:-A two-way ANOVA (are also called factorial ANOVA) refers to an ANOVA using two independent variables. 
    • Expanding the example above, a Two-way ANOVA can examine differences in Corona Cases (the dependent variable) by Age group (independent variable 1) and Gender (independent variable 2). Two-way ANOVA can be used to examine the interaction between the two independent variables. Interactions indicate that differences are not uniform across all categories of the independent variables. 
    • For example, Old Age Group may have higher Corona cases overall compared to Young Age group, but this difference could be greater (or less) in Asian countries compared to European countries.
  3. N-Way ANOVA- A researcher can also use more than two independent variables, and this is an n-way ANOVA (with n being the number of  independent variables you have) aka MANOVA Test.
    • For example, potential differences in Corona cases can be examined by Country, Gender, Age group, Ethnicity, etc, simultaneously.
    • An ANOVA will give you a single (univariate) f-value while a MANOVA will give you a multivariate F value.

With Replication VS Without Replication:-

You may hear with Replication and without Replication with regards to ANOVA test frequently. Let’s understand what are these-

  1. Two way ANOVA with replication: Two groups, and the members of those groups are doing more than one thing.
    • For example, Let’s say vaccine has not been developed for COVID-19 , and doctors are trying two different treatments to cure two groups of COVID-19 infected patients .
  2. Two way ANOVA without replication: It’s used when you only have one group and you’re double-testing that same group.
    • For example, Let’s say vaccine has been developed for COVID-19 , and researchers are testing one set of volunteers before and after they’ve been vaccinated to see if it works or not.

Post ANOVA Test:-

When you conduct an ANOVA, you are attempting to determine if there is a statistically significant difference among the groups. If you find that there is a difference, you will then need to examine where the group differences lay.

So basically Post-hoc tests tell the researcher which groups are different from each other.

At this point you could run post-hoc tests which are tests examining mean differences between the groups.  There are several multiple comparison tests that can be conducted that will control for Type I error rate, including the Bonferroni, Scheffe, Dunnet, and Tukey tests.

Now let’s understand each type of ANOVA test with some real data and explore it using Python-

One way Anova Test:-

I’ve downloaded this data from ongoing Kaggle competition. Please feel free to make use of it.

Click here to download the data set. In this test we’d try to analyse the relation between the density of a region or state and number of corona cases.So we’d map each state according to density of population residing in it.

So let’s start by importing all required libraries and data.

Load data from directory-

StatewiseTestingDetails contains information about total positive & negative case in a day in each state. Whereas population_india_census2011 contains information about density of each state and other related information about population.

From above snippet of code , we see that there’re few states that have 0 or no corona cases in a day. So let’s check out such states.

We see that Nagaland & Sikkim states have no corona case also in a day. On other hand Arunachal Pradesh & Mizoram states have only 1 corona case in a day.

Impute Missing values :- We’ve noticed that there’re many missing values also in ‘Positive’ column. So let’s impute missing values by median of Positive with respect to each state.

Now we can write a function to create density group bucket as per density of each state. where Dense1 < Dense2 < Dense3 < Dense4.

Now map each state with it density group and meanwhile we can export this data also , so we can use that in Two- way ANOVA test later.

Let’s subset and rearrange our dataset that we can use for our ANOVA test.

One of our ANOVA test’s assumptions is that samples should be randomly selected and should be close to Gaussian Distribution. So let’s select 10 random sample from each factor or level.

Let’s plot density distribution of number of Corona cases to check their distribution across different density groups.

We clearly see that data doesn’t follow the Gaussian distribution.

There’re different Data transformation methods available to bring data to close to Gaussian Distribution. We’ll go ahead with Box Cox transformation.

Now let’s plot their distribution once again to check-

Approach 1. One-Way ANOVA Test using statsmodels module:-

There’re couple of methods in Python to perform ANOVA test. One is with the stats.f_oneway() method.

We see that p-value <0.05 , Hence we can reject the Null Hypothesis that there’re no difference among different density groups.

Approach 2. One-Way ANOVA Test using ols Model:-

As we know in regression , we can regress against each input variable and check its influence over the Target variable. So we’ll follow the same approach , the approach we follow in Linear Regression.

From the above output results, we see that p-value is less than 0.05 . Hence we can reject the Null Hypothesis that there’re no difference among different density groups.

The F-statistic= 5.817 and the p-value= 0.002 which is indicating that there is an overall significant

effect of density_Group on corona positive cases. However, we don’t know where the difference between desnity_groups is yet.

So Based on p-value we can reject the H0; that is there’s no significant difference as per density of an area

and number of corona cases

Post Hoc Tests:-

When you conduct an ANOVA, you are attempting to determine if there is a statistically significant difference among the groups.

So what if you find statistical significance?

If you find that there is a difference, you will then need to examine where the group differences lay. So we’ll use Tukey HSD test to identify where the difference lies.

tuckey HSD test clearly says that there’s a significant difference between Group1- Group3 ,Group1- Group4,Group2- Group3 and Group3- Group4.

So above result from Tukey HSD suggests that except mentioned groups, all other pairwise comparisons for number of Corona cases rejects null hypothesis and indicates no statistical significant differences.

Assumption Checks/Model Diagnostics

Normal Distribution Assumption check:-

when working with linear regression and ANOVA models, the assumptions pertain to the residuals and not the variables themselves.

Method 1. shapiro wilk test :-

From the above snippet of code , we see that p-value is >0.05 for all density groups. Hence we can conclude that they follow the Gaussian Distribution.

Method 1. Q-Q plot test :-

We can use Normal Q-Q plot to test this assumption.

From the above figure , we see that all data points lie to close to 45 degree line and hence we can conclude that it follows Normal Distribution.

Homogeneity of Variance Assumption check: –

Homogeneity of variance assumption should be checked for each level of the categorical variable. We can use the Levene’s test to test for equal variances between groups

Homogeneity of Variance Assumption check: –

Homogeneity of variance assumption should be checked for each level of the categorical variable. We can use the Levene’s test to test for equal variances between groups

We see that p-value >0.05 for all density groups. Hence we can conclude that groups have equal variances.

We see that p-value >0.05 for all density groups. Hence we can conclude that groups have equal variances.

Now let’s get started with Two- Way ANOVA Test

Two Way ANOVA Test:-

Again , using the same dataset , we’d try to understand , if there’s any significant relationip between the density of a region or state , age of people and number of corona cases. So we’d map each state according to density of population residing in it.

So let’s start by importing all required libraries and data.

Let’s understand the data and check if there’s any data ambiguity .

From the above snippet of code , we can see that there’s no record of any infected infant.

Check for missing values in our data.

We see that more than 91% and 80% entries are missing in age & gender columns respectively. So we need to devise a method in order to impute them.

So I’ll impute age with median value in each state and gender by mode of male & female in each state. So I’ll calculate median and mode for this calculation.

Now let’s merge individualDetails & stateDensity dataframe to create an overall dataset for us.

Now we can create age group bucket.

Now merge the data to get dataset with each person mapped with their age group and their respective state density group

Now check distribution of Count column in our data and check if there’re outliers present in our data using box plot method.

We see that there’re many outliers present in our data. And even distribution of Cunt variable is not Gaussian. So we’ll employ Box Cox transformation method to handle the situation.

Now let’s employ the OLS model to check our Hypothesis.

From the above Q-Q plot we can see that residuals are almost normally distributed . Although points at the extreme ends can be discounted. Hence we can conclude that it satisfies the Normality assumption of ANOVA test.

As we can see that, residuals are almost normally distributed . Although points at the extreme ends can be discounted.

Interpretation of results:-

Interpretation: The P-value obtained from ANOVA analysis for number of Corona cases, age group and density group and interaction are statistically significant

(P<0.05). We conclude that type of density_Group significantly affects the corona cases outcome.

age_Group significantly affects the corona cases outcome outcome, and interaction of both age_Group and density_Group significantly affects the corona cases outcome outcome.

Post Hoc Test:-

Now let’s identify which and all group are statistically different. We’ll use Tuckey HSD method for it.

From the above tuckey HSD test results , we can see clearly says that there’s a significant difference between Group1- Group3 ,Group1- Group4 in density groups and Young – Adult & Young-old groups in age group also.

So above result from Tukey HSD suggests that except mentioned groups, all other pairwise comparisons for number of Corona cases rejects null hypothesis and indicates no statistical significant differences.

End Notes: –

I’ve tried my bit to explain ANOVA test as simple as possible. Please feel free to comment down your queries. I’ll be more than happy to answer.You can clone my Github repository to download the whole code & data, click here!!

Article Credit:-

Name:  Praveen Kumar Anwla
6+ year of work experience

Leave a Comment