7. Data Analysis I
7.3 Collecting Data
Learning Objectives
By the end of this section it is expected that you will be able to:
- State whether data is quantitative or qualitative
- Describe the random sampling methods: simple random sampling, systematic sampling, cluster sampling and convenience sampling
- Discuss potential problems that might arise when sampling from a population
Populations and Samples
In statistics, we generally want to study a population. You can think of a population as a collection of persons, things, or objects under study. It is often not feasible or possible to study the entire population. Instead we can select a sample. The idea of sampling is to select a portion (or subset) of the larger population and study that portion (the sample) to gain information about the population. Data are the result of sampling from a population.
Because it takes a lot of time and money to examine an entire population, sampling is a very practical technique. If you wished to compute the overall grade point average at your school, it would make sense to select a sample of students who attend the school. The data collected from the sample would be the students’ grade point averages. In elections, opinion poll samples of 1,000–2,000 people are taken. The opinion poll is supposed to represent the views of the people in the entire country.
Types of Data
Most data can be categorized as qualitative or quantitative.
Qualitative data are the result of categorizing or describing attributes of a population using our senses such as sight or touch. Hair color, blood type, ethnic group, the car model that a person drives, and the street a person lives on are examples of qualitative data. Qualitative data are generally described by words or letters. For instance, hair color might be black, dark brown, light brown, blonde, gray, or red. Blood type might be AB+, O-, or B+.
Quantitative data are always numbers. Quantitative data are the result of counting or measuring attributes of a population. Amount of money, pulse rate, weight, number of people living in your town, and number of students who take statistics are examples of quantitative data. Researchers often prefer to use quantitative data over qualitative data because it lends itself more easily to mathematical analysis. For example, it does not make sense to find an average hair color or median blood type.
EXAMPLE 1
Consider a high school math class and a sample of five student’s backpacks. Determine whether the data is quantitative or qualitative.
1. One data set is the number of books students carry in their backpacks.Two students carry three books, one student carries four books, one student carries two books, and one student carries one book.
2. For the sample of five backpacks you weigh the backpacks and contents. The weights (in kilograms) of their backpacks are 3.2, 5, 4.8, 5.1, 2.3.
3. For the sample of five students you record the colour of the backpacks. The books are red, blue or black.
Solution:
- This is quantitative data.
- This is quantitative data.
- This is qualitative data.
TRY IT 1
Determine the correct data type (quantitative or qualitative).
- the number of pairs of shoes you own
- the colour of vehicle you drive
- the distance it is from your home to the nearest grocery store
- the number of classes you take per school year.
- the model of calculator you use
- weights of sumo wrestlers
- total number of correct answers on a quiz
- IQ scores
Show answer
Items a, c, d, f, g and h are quantitative; items b and e are qualitative.
It is often possible to assign both qualitative and quantitative measures to one set of data.
EXAMPLE 2
You go to the supermarket and purchase three cans of soup (350 ml tomato, 400 ml lentil, and 250 ml chicken noodle), four different kinds of vegetables (broccoli, cauliflower, spinach, and carrots), and two containers pf ice cream (pistachio ice cream and vanilla ice cream).
Name the data sets that are qualitative.
Solution
The types of soups, vegetables and desserts are qualitative data because they are categorical. They are not measured or counted.
TRY IT 2
You go to the supermarket and purchase three cans of soup (350 ml tomato, 400 ml lentil, and 250 ml chicken noodle), four different kinds of vegetables (broccoli, cauliflower, spinach, and carrots), and two containers pf ice cream (pistachio ice cream and vanilla ice cream).
Name the data sets that are quantitative.
Show answer
The three cans of soup, four kinds of vegetables and two ice creams are quantitative data because you count them. The weights of the soups are quantitative because you measure weights as precisely as possible.
Sampling
Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing. There are several different methods of random sampling. This section will describe four of the most common methods. In each form of random sampling, each member of a population initially has an equal chance of being selected for the sample.
Simple Random Sampling
The easiest method to describe is called a simple random sample. Any group of ‘n’ individuals is equally likely to be chosen as any other group of ‘n’ individuals if the simple random sampling technique is used. In other words, each sample of the same size has an equal chance of being selected.
For example, suppose Lisa wants to form a four-person study group (herself and three other people) from her pre-calculus class, which has 31 members not including Lisa. To choose a simple random sample of size three from the other members of her class, Lisa could put all 31 names in a hat, shake the hat, close her eyes, and pick out three names. An alternative is for Lisa to alphabetically list the last names of the members of her class and number each with a two-digit number 01, 02, 03, 04, 05, 06,…31. Lisa can use a table of random numbers (found in many statistics books) a calculator, or a computer to generate random numbers.
EXAMPLE 3
How can Lisa determine three group mates from a numbered list of 31 students?
Solution
Lisa can generate random numbers from a calculator.
The calculator generates the first seven random numbers as follows: 0.943 0.230 0.046 0.514 0.405 0.733 0.983 Lisa reads two-digit groups until she has chosen three class members. Each random number may only contribute one class member.
The first random number 0.943 is read as the numbers 94 and 43. Neither of these corresponds to the students’ assigned numbers (01 to 31).
The random number 0.230 is read as 23 and 30. Although both of these numbers corresponds to a student, only the first number, 23, will be used. The first student will be number 23.
The random number 0.046 is read as 04 and 46 which corresponds to student 04. The second student will be student number 4.
The third student will correspond to the number 14 which is read from the random number 0.514 (since there is no student numbered 51).
The three names that correspond to the two-digit numbers 23, 04 and 14 will form Lisa’s group. If she needed to, Lisa could have generated more random numbers.
TRY IT 3
A fitness studio plans to purchase new equipment and wants to conduct a survey of its membership. There are over 700 members and the studio wishes to survey only a portion of this membership. Upon purchasing a membership, every member has been assigned a 3 digit membership number. Decribe how the studio can use the membership numbers to select a simple random sample of 80 members.
Show answer
A random number generator is used to generate a list of three digit numbers. Each random number that is generated will be compared with the membership numbers. If the number has been assignd to a member then that member will be one of the survey group. If the random number has not been assigned then the next random number is considered until 80 members have been selected.
Systematic Sampling
Systematic sampling is where the first sample member from a larger population is selected according to a random starting point. Additional sample members are then selected based on a fixed interval. The interval is calculated by dividing the population size by the desired sample size. If the population consists of 500 members and the desired sample size is 50, then the interval would be 500/50 = 10. Every tenth member of the population would be part of the sample.
EXAMPLE 4
A high school counsellor is conducting a survey of the graduating class which consists of 1243 students. Describe how the counsellor can select a systematic sample of 50 students.
Solution
The counsellor can interview 50 students. The interval is calculated as 1243 students/50 = 24.86 which rounds up to 25. This determines the interval increment as 25 so every 25th student will be in the sample.
To obtain the sample, the counsellor accesses the alphabetical list of graduates and generates a random number. Suppose the number is 03. The counsellor will interview the 3rd student on the list followed by every 25th student on the list: This will yield a sample of student 3, 28, 53, 78, and so on until 50 names have been chosen.
TRY IT 4
A fitness studio plans to purchase new equipment and wants to conduct a survey of its membership. There are over 700 members and the studio wishes to survey only a portion of this membership. Upon purchasing a membership, every member has been assigned a 4 digit membership number. Decribe how the studio can use the membership numbers to select a systematic sample of 80 members.
Show answer
Since 80 members are needed for the survey, the total number of members will be divided by 80. Assume there are 724 members, then 724/80 = 9.05 which rounds to an increment of 9. This determines the increment for the intervals. A list of 3-digit random numbers is generated to determine the first member in the survey group and every 9th member will be included in the survey group. If the first member has a number 546, then every 9th member counting from 546 will be chosen. When the end of the membership list is reached the increments will continue counting from the beginning of the list unil 80 members are selected.
Cluster Sampling
To choose a cluster sample, divide the population into clusters (groups) and then randomly select some of the clusters. Every member from each of the selected clusters will be in the cluster sample. This type of sampling works best in populations that can be grouped into distinct groups. In a 50 floor apartment building, each floor could represent a cluster. In a hockey league, each team could be a cluster.
EXAMPLE 5
A textbook publisher plans to conduct a survey of the faculty at a college campus. There are 23 departments at the college. Describe how the publisher can use the departments to select four cluster samples.
Solution
Let each department represents one cluster. The publisher numbers the departments from one to twenty-three and randomly selects 4 numbers which determine the four departments. Only these four departments will form the cluster sample and all faculty within the four departments (clusters) will be surveyed.
TRY IT 5
A textbook publisher plans to conduct a survey of the students at a college campus. There are 45 program areas ranging from 18 to 40 students in each program. Decribe how the publisher can use the program areas to select a cluster sample of at least 100 students.
Show answer
The publisher numbers the program areas from one to forty-five and generates random numbers. The first random number is used to determine the first program area (cluster). Additional random numbers are assigned to clusters until there are at least 100 students for the survey. Only the students in the selected programs (clusters) will be surveyed.
Cluster sampling can reduce the need for resources and may be more efficient. Disadvantages are that it can introduce biases or it may not represent the total population. In example 5, perhaps the textbook publisher is seeking feedback on its textbooks. If one or more of the chosen clusters does not use textbooks then the results may not be reliable.
Convenience Sampling
A type of sampling that is non-random is called convenience sampling. Convenience sampling involves using results that are readily available or convenient.
EXAMPLE 6
A computer software developer seeks to determine which of its new video games are the most popular among females. Describe how the developer can select a convenience sample.
Solution
The developer can conduct a marketing study by going to a local electronic gaming store and ask all female shoppers as they enter the store if they will participate in a 3 minute survey on video games.
TRY IT 6
A fitness studio plans to purchase new equipment and wants to conduct a survey of its membership. There are over 700 members and the studio wishes to survey 100 of its members. Decribe how the studio can select a convenience sample of 80 members.
Show answer
The studio owner prepares a survey and distributes it to all members who visit the studio over a 3-day period.
This form of sampling may be appealing due to its convenience but the results can be misleading. This type of surveying may be good in some cases but it can also be highly biased (favor certain outcomes) in others.
EXAMPLE 7
A study is done to determine the average tuition that undergraduate students pay per semester. Each student in the following samples is asked how much tuition he or she paid for the Fall semester. What is the type of sampling in each case? (simple random, systematic, cluster, or convenience)
- A random number generator is used to select a student from the alphabetically numbered email listing of all undergraduate students in the Fall semester. Starting with that student, every 50th student is chosen until 75 students are included in the sample.
- A random number generator is used to select 75 student ID numbers.
- The freshman, sophomore, junior, and senior years are numbered one, two, three, and four, respectively. A random number generator is used to pick two of those years. All students in those two years are in the sample.
- An administrative assistant is asked to stand in front of the library one day and to ask the first 100 undergraduate students he encounters what they paid for tuition in the Fall semester.
Solution
- systematic
- simple random
- cluster
- convenience
TRY IT 7
Determine the type of sampling used (simple random, systematic, cluster, or convenience).
- A pollster interviews all human resource personnel in five different high tech companies.
- A medical researcher interviews every third cancer patient from a list of cancer patients at a local hospital.
- A high school counselor uses a computer to generate 50 random numbers and then picks students whose names correspond to the numbers.
- A student interviews classmates in his algebra class to determine how many pairs of jeans a student owns, on the average.
Show answer
- cluster
- systematic
- simple random
- convenience
Potential Survey Issues
Users of statistical studies should be aware of the sampling method before accepting the results of the studies. Common problems to be aware of include:
- Nonrepresentative samples: A sample must be representative of the population under study. A sample that is not representative of the population is biased. Biased samples that are not representative of the population give results that are inaccurate and not valid. An example of a biased sample would be a survey on violence in sports where only the female students in a coed high school are surveyed.
- Self-selected samples: Surveys where responses are voluntary, such as call-in surveys, are often unreliable.
- Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions. Examples would include crash testing of cars or medical testing for rare conditions.
- Undue influence: collecting data or asking questions in a way that influences the response. An example would be conducting a taste test of two sodas where one is refrigerated and the other is served at room temperature.
- Non-response or refusal of a subject to participate: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results. As an example, reviewers on Internet travel sites may not be representative of the entire population.
- Misleading use of data: Be aware of improperly displayed graphs, incomplete data, or lack of context.
Key Concepts
When conducting a survey we can choose from several sampling methods:
- Simple random sampling is where a member of the population is equally as likely to be chosen as any other member from the population.
- Systematic sampling is where the first sample member from a larger population is selected according to a random starting point. Additional sample members are then selected based on a fixed interval.
- Cluster sampling is where the population is divided into clusters (groups) and then a specific number of clusters is randomly selected. Every member from each of the selected clusters will be in the cluster sample.
- Convenience sampling is where the selection is made from a part of the population that is easy to access.
Glossary
qualitative data
are the result of categorizing or describing attributes of a population using our senses such as sight or touch.
quantitative data
are the result of counting or measuring a specific attribute of a population.
7.3 Exercise Set
- Shoppers at a farmer’s market were surveyed to determine how environmentally and market friendly they were. The survey recorded the A) type of bag (cloth, plastic, none, wicker, other) B) the number of bags (0, 1, 2, 3, more than 3) C) the number of market visits per year D) Average amount of money per visit spent at the market E) preferred vendor(s) . Which of A, B, C, D, E are qualitative and which are quantitative?
- A census yields a wide variety of data. State whether each of the following questions would provide qualitative or quantitative data.
- What province do you live in?
- How many years have you lived at your current address?
- What type of dwelling do you live in (house, apartment, condo, mobile home, other)?
- How many people live in your home?
- How many years languages do you speak?
- What languages do you speak?
- What is your occupation?
- What is your annual salary?
- Consider a typical classroom in college or university. Name two types of qualitative data and two types of quantitative data that could be collected. e.g. qualitative – score from an entrance exam; quantitative – country of birth
- A study is done to determine the food outlet preferences for all students living on campus in the fall semester. Each student in the sample will be asked the same set of 10 questions. Four different sampling techniques are described below. What is the type of sampling in each case? (simple random, systematic, cluster, or convenience).
- There are 8 different student residences on campus. Two residences are randomly selected and every student living in those two residences is surveyed.
- The surnames of all students living on campus are arranged alphabetically and numbered from 1 to n (where n is the number of students living on campus). A random number generator is used to determine a number between 1 and 50. This number is matched to a student with the same number. Starting with that student, every 50th student is chosen until the required number of students is chosen for the sample.
- One of the food outlets is chosen by drawing one outlet name. Over a four hour period one day, four helpers stop all students entering that food outlet and if they live on campus they are administered the survey.
- A computer is used to generate random numbers that have the same format as the students’ ID numbers. Random numbers are generated until 100 random numbers are matched by student number to a student living on campus. These 100 students are contacted and arrangements are made for the interviewer to meet with the student.
- State one advantage and one disadvantage for each of systematic sampling, cluster sampling and convenience sampling.
- A marketing company wants to determine which is more popular – its lemonade or a competitor’s lemonade. The company sets up a booth at a local arena the evening of a Professional Boxing Match. Anyone who visits the booth is asked to choose their favourite lemonade from two unmarked glasses of lemonade. The marketing company’s lemonade is made onsite and served with ice and a fresh slice of lemon; the competitor’s lemonade is poured straight from a bottle. All taste testers receive a chance to win a television. Name at least three problems with the methodology used for this marketing company’s taste test.
Answers
- Qualitative is A & E; Quantitative is B, C D
- Qualitative is a, c, f, g; Quantitative is b, d, e, h
- Answers will vary
-
- Cluster
- Systematic
- Convenience
- Simple Random
- Answers may vary. Systematic Sampling avoids bias but it involves a commitment in time. Cluster sampling involves less time to determine the sample but it can be biased. Convenience sampling can involve less effort but it may be non representative of the population
- Non representative sample – attendees at a boxing match may not be interested in lemonade ; Possibly not a big enough sample; Undue Influence – the two lemonades are served up very differently; Not random but instead involves self-selection by the participants (the taste testers must choose to go to the booth); Testers might participate only for the chance to win a TV and may not provide reliable feedback.
Attribution
This chapter has been adapted from “Data, Sampling, and Variation in Data and Sampling” in Introductory Statistics (OpenStax) by Barbara Illowsky and Susan Dean which is under a CC BY 4.0 Licence. Adapted by Kim Moshenko. See the Copyright page for more information.