# Statistics __________ ### P-Value - Statistically significant is detected when the observed p-value from the test statistic is less than the level of significance (i.e. alpha). - Meaning, there is a very low probability of observing by random chance something as extreme or more extreme than what was observed under the assumption that the null hypothesis is true. ``` P-Value < 0.05 = Reject null hypothesis; P-Value > 0.05 = Fail to reject null hypothesis; ``` ![image](../assets/pvalue.png) **Notes:** - If we reject the null hypothesis, this indicates strong statistical significance. ### Z-Test - The one-sample z-test is used to test whether the mean of a population is greater than, less than, or not equal to a specific value. - For large (50 or more observations) normally distributed samples, normal distribution tests are equivalent to the T-Test. **Requirements for the Z-test:** - The mean and standard deviation of the population distribution are known - The mean of the sample distribution is known - The variance of the sample is assumed to be the same as the population - The population is assumed to be normally distributed ![image](../assets/ztest.png) ### T-Test - Statistical test that is used to compare the means of two groups. - When the original (population) distribution is not normal, the one sample t-test is still valid with a large enough sample size. - We perform a **One-Sample t-test** when we want to compare a sample mean with the population mean. - The difference from the Z Test is that we do not have the information on Population Variance here. - We use the sample standard deviation instead of population standard deviation in this case. **One-Sample T-Test:** **Assumptions** - Samples are drawn from a Normally distributed population. - The observations in the sample are independent of one another. - When the original (population) distribution is not normal, the one sample t-test is still valid with a large enough sample size. - That is, the one sample t-test is robust to the normality assumption when the sample size is large enough. **Rules of Thumb in Evaluating Assumptions:** - If sample sizes are the same and sufficiently large, the t-tools are valid since they are robust to the violation of normality. - If the two populations have the same standard deviation then the t-tests are valid given sufficient sample sizes. - If the standard deviations are different and the sample sizes are different, then the t-tools are not valid and another procedure should be used. **Transformations** - If assumptions are not met, look at transforming the data such as taking the logarithmic transformation. ### Choosing a Hypothesis Test ![image](../assets/hypothesis.png) ### Statistical Power - A power test will tell us how many samples we will need to collect to have a good amount of statistical power. - It tells you how many trials you need to do to avoid incorrectly rejecting the null hypothesis. ``` Power = Type II Error = fail to reject a false null hypothesis. ``` - `Power` is the probability of not making a Type II error. - `Power` = the probability that we correctly reject the null hypothesis (e.g. small p-value). or “the probability of rejecting a null hypothesis when it is false”. - `Low Power `= when there is a lot of over lap between the two distributions and we have a small sample size, we have low power. - When we have a lot of power, there is a higher probability that we will correctly reject the null hypothesis. ![image](../assets/power.png) ### ANOVA - Statistical technique that is used to check if the means of two or more groups are significantly different from each other. **Assumptions:** - **Normality**: - Similar to Hypothesis Testing, ANOVA is robust to this assumption. - Extremely long-tailed distributions (outliers) or skewed distributions coupled with different sample sizes (especially when the sample sizes are small) present the only serious distributional problems. - **Equal Standard Deviations: ** - This assumption is crucial, paramount, VERY important. - The assumptions of independence within and across groups are critical. - If lacking, different analysis should be attempted. ### F-Test - Hypothesis test to check if Evidence of Inequality of Variance ### Central Limit Theorem - The distribution of sample x’s will, as the sample size increases, approach a normal distribution. - The mean (x) of the sample means is the population mean µ. - The standard deviation of the distribution of sample means is sigma/sqrt(n). ### Confidence Interval - A 95% confidence interval means that if we were to take 100 different samples and compute a 95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain the true mean value. ``` CI = mean += Z * sigma / sqrt(n) ``` ## Multicollinearity - Occurs whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. - `Multicollinearity` can also be detected with the help of tolerance and its reciprocal, called variance inflation factor (VIF). A `VIF >= 10` is problematic. **Problem:** - An independent variable that is very highly correlated with one or more other independent variables will have a relatively large standard error. - This implies that the partial regression coefficient is unstable and will vary greatly from one sample to the next.