This time I thought of writing about “Inferential Statistics”.
Well this is very interesting topic which every aspiring Data Scientist must know about.
The ability to estimate population parameters (say mean, standard deviation, variance etc..) or to test hypotheses about population parameters using sample statistics is one of the main applications of statistics in improving decision making business. Whether estimating parameters or testing hypotheses about parameters, the inferential process consists of taking a random sample from a group or body (the population), analyzing data from the sample, and reaching conclusions about the population using sample data.
But before diving into the main concepts let us understand some basic terminologies:
- Parameter : a descriptive measure of the population is called a “parameter”. They are generally denoted by Greek letters. Examples of parameters are population mean (μ), population variance (σ2), and the population standard deviation (σ)
- Statistic : a descriptive measure of the sample is called a “statistic“. They are usually denoted by Roman letters. Example of statistics are sample mean(x ̅) and sample variance(s2) and sample standard deviation (s).
- z-score: a z-score represents the number of standard deviations a value (x) is above or below a mean of a set of numbers when the data is normally distributed. Using z scores allows translation of a value’s raw distance from mean into units of standard deviations. Remember z-scores are also used for standardization when the data belongs to various scales or units of measurement. For example: Consider the following table
SAT ACT Mean 1500 21 Standard deviation 300 5
The above table shows the mean and standard deviation for total scores on the SAT and ACT examinations. The distribution of SAT and ACT scores are both normal.
Suppose Ann scored 1800 on her SAT and Tom scored 24 on his ACT. Who performed better???
Since, the scores of Ann and Tom comes from two different different distributions we cannot compare them as it is as we need some common ground of comparison. z-score helps us here. Let us look at the formula of z-score first:
z-score of Ann = (1800-1500)/300 = 1
z-score of Tom = (24-21)/5 = 0.6
Now here, we can see that Ann is 1 s.d. above average on the SAT and Tom is 0.6 s.d. above the average on the ACT. Therefore, Ann tends to do better w.r.t to everyone else than Tom did, so her score was better.
- Point estimate: a point estimate is a statistic(mean, median, s.d etc) from a sample that is used to estimate a population parameter. A point estimate is only as good as the representativeness of its sample. If other random samples are taken from the population, the point estimates derived from those random samples would vary. Because of variation in sample statistics, estimating a population parameter with an interval estimate is often preferable to using a point estimate. That interval is called confidence interval.
- Confidence Interval: it is a range of values within which the analyst can declare, with some confidence, the population parameter lies.Using only a point estimate is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net. We can throw a spear where we saw a fish, but we will probably miss. On the other hand, if we toss a net in that area, we have a good chance of catching the fish.Sometime later in your study you might here a term called 95% confidence interval. What it actually means is that if you draw 100 random samples from a population and built a confidence interval from each sample, then about 95% of those intervals would contain the actual mean μ.
- Standard error(SE): the standard deviation associated with an estimate is called the standard error. It describes the typical error or uncertainty associated with the estimate.Computing SE for the sample mean :Given n independent observations from a population with standard deviation σ, the standard error of the sample mean is equal toSE = σ/sqrt(n)A reliable method to ensure sample observations are independent is to conduct a simple random sample consisting of less than 10% of population. However, there is one subtle issue in the above equation, the population standard deviation is unknown. But we can use the point estimate of the standard deviation of the sample when the sample size is at least 30 and the population distribution is not skewed.
Now let us look at the following chart to understand which statistic to use in which scenarios:
From the diagram we can see that when we have known population stddev we can use z statistic to estimate the confidence interval for population mean otherwise we can use t-statistic.
In this post I am keeping it up to the basics. In the next post, I will discuss in detail how to compute the confidence interval for above mentioned scenarios.
Till then, Happy Learning….
References : Applied Business Statistics by Ken Black