Many students new to statistics struggle with understanding population parameters, confidence intervals, hypothesis testing, and statistical inference. I usually try to help them by drawing the above picture on a white board and discussing the data their organization collects. I explain to them that the data they usually collect is a sample and how that sample is different from the population from which the sample is taken.

The population is a complete set of the things we are interested in. It can be a complete set of the SAT scores taken by all high school senior girls in the USA in 2019 or the number of field goals scored in all college football games this past fall. It can also be a certain dimensional characteristic on a part number your company has made for the past ten years, or the number of patients your organization has admitted since its inception. The population is characterized by what are called population parameters. They are the mean, μ ( *mu*), the standard deviation, σ (*sigma*), etc.

A sample, on the other hand, is a subset of a population. Looking at the previous examples, a sample of SAT scores would be the scores of those senior girls in your local high school who took the exam, or the number of field goals scored by your favorite college football team, or the dimensional data you take during the day on the part number your company makes, or the number of patients admitted so far in 2020 by your organization. These are characterized by sample statistics such as mean, x-bar, and sample standard deviation, s, etc.

Taking sample data is relatively easy. You observe a process and measure the characteristic you’re interested in. However, the task of determining the population parameters is at the least impractical if not impossible. To overcome this dilemma, we calculate the statistics from the sample data (x-bar, s, etc.) and use them to estimate the population parameters (μ, σ, etc.). These population parameters describe the characteristics of the population, made from the sample, and are called statistical inference.

One of the difficulties of using samples to represent populations is the selection of the sample. We want the sample to truly represent the population so we can generalize our findings to the population and make the claim that the population will perform like the sample. If we have a sample with the same characteristics as the population we have a representative sample. If the characteristics of the sample are different, then any findings based on the sample could be biased and not generalizable to the population.

In many cases, it is almost impossible to obtain a truly representative sample, where every characteristic of the sample matches the population characteristic. In these cases a random sample is taken where each member of the sample being observed has an equal chance of being selected.

A single statistic calculated from a sample, mean or sample standard deviation, is called a point estimate. Under most circumstances, it is possible to supplement the point estimate with a statement about the uncertainty of the estimate and we call these statements confidence limits. As an example, we say the mean of a sample is x-bar and the mean of the population, μ, with some stated confidence level, lies between two calculated values.

I will discuss statistical inference further in future blog articles.