Introduction to Statistics

Introduction to Statistics

By the end of this topic, you should be able to:

Subtopics

 

Objectives

Introduction to statistics

 

  • Understand why people study statistics
  • Distinguish between descriptive and inferential statistics
  • Distinguish between qualitative variable and quantitative variable

Statistical terms

 

  • Understand some statistical terms

Methods of data collection

 

  • Distinguish among nominal, ordinal, interval and ratio
  • Distinguish between primary data and secondary data
  • Understand various of data collection methods

Statistical terms

 

  • Understand some statistical terms

Sampling Techniques

 

  • Distinguish some of the sampling techniques : Random and Non-Random Sampling

Definition of Statistics:

Method of collecting, organizing, summarizing, presenting, analyzing and interpreting data (information) in a convenient and informative way to assist in making more effective decisions.

Statistics plays a vital role in today’s world, where data is being generated at an unprecedented rate. It helps us to understand and make sense of the vast amount of information available to us. Here are a few examples of how statistics is used in the current scenario:

In Healthcare: Statistics is used to analyze and interpret medical data, such as patient outcomes, drug efficacy, and disease prevalence. It is also used to predict the likelihood of diseases and help develop effective treatment plans. For example, during the ongoing COVID-19 pandemic, statisticians are playing a crucial role in analyzing the spread of the virus and predicting its impact on healthcare systems.
In Business: Statistics is used in business to analyze sales data, customer behavior, market trends, and financial performance. It helps businesses to make informed decisions about pricing, product development, and marketing strategies. For example, an e-commerce company can use statistics to identify customer preferences, optimize their website for maximum conversions, and target the right audience through advertising.
In Sports: Statistics is used in sports to analyze player performance, predict outcomes, and make strategic decisions. It is used in various sports, including baseball, basketball, football, and soccer. For example, in baseball, statistics such as batting average, runs batted in, and earned run average are used to evaluate player performance and determine player salaries.
In Politics: Statistics is used in politics to analyze voting patterns, conduct public opinion polls, and predict election outcomes. It helps politicians to understand their constituents’ needs and make informed decisions about policy-making. For example, in the United States, polling data is used to predict the outcome of presidential elections and guide campaign strategies.

In conclusion, statistics plays a critical role in many areas of our lives, including healthcare, business, sports, and politics. It helps us to make informed decisions based on data and enables us to better understand the world around us.

Statistics can be categorized as descriptive statistics and inferential/inductive statistics.

Read more …

Descriptive statistics is a crucial part of statistical analysis that helps us to understand the nature of a set of data. Here are a few examples of how descriptive statistics is used in the current scenario:

  1. In Healthcare: Descriptive statistics is used to summarize patient data, such as age, gender, and medical history, and identify patterns in the data. It is also used to describe the distribution of medical data, such as blood pressure, cholesterol levels, and BMI, in a population. For example, a study may use descriptive statistics to determine the average age of patients with a particular disease and the range of symptoms they exhibit.

  2. In Business: Descriptive statistics is used to summarize sales data, such as revenue, profit margins, and customer behavior. It helps businesses to identify trends in customer preferences and make informed decisions about pricing, product development, and marketing strategies. For example, a business may use descriptive statistics to calculate the average purchase value of customers and identify the most popular products.

  3. In Education: Descriptive statistics is used in education to summarize student performance data, such as test scores, grades, and attendance. It helps educators to identify areas of improvement and track progress over time. For example, a school district may use descriptive statistics to determine the average test score of students in different subjects and identify areas where students are struggling.

  4. In Social Sciences: Descriptive statistics is used in social sciences to summarize data related to attitudes, beliefs, and behaviors. It helps researchers to understand the patterns and trends in the data and make informed decisions about policy-making. For example, a survey may use descriptive statistics to determine the percentage of the population who support a particular policy and identify the factors that influence their attitudes.

In conclusion, descriptive statistics is a critical tool for summarizing and organizing data into meaningful patterns. It is used in various fields, including healthcare, business, education, and social sciences, to make informed decisions and identify areas of improvement.
Read more …

Inferential statistics is an important branch of statistical analysis that is used to make predictions and draw conclusions about a population based on a sample of data. Here are a few examples of how inferential statistics is used in the current scenario:

  1. In Healthcare: Inferential statistics is used to make predictions about a population based on a sample of patient data. It is also used to test hypotheses about the effectiveness of medical treatments and identify the factors that influence patient outcomes. For example, a study may use inferential statistics to determine if a new drug treatment is more effective than an existing treatment based on a sample of patient data.

  2. In Business: Inferential statistics is used to make predictions about customer behavior and market trends based on a sample of sales data. It is also used to test hypotheses about the relationship between different variables, such as price and demand. For example, a business may use inferential statistics to determine if there is a significant relationship between the price of a product and the demand for that product based on a sample of sales data.

  3. In Education: Inferential statistics is used to make predictions about student performance and identify factors that influence academic achievement based on a sample of student data. It is also used to test hypotheses about the effectiveness of different teaching methods and educational interventions. For example, a study may use inferential statistics to determine if there is a significant difference in academic achievement between students who receive one-on-one tutoring and those who receive group instruction based on a sample of student data.

  4. In Social Sciences: Inferential statistics is used to make predictions about attitudes, beliefs, and behaviors based on a sample of survey data. It is also used to test hypotheses about the relationships between different variables, such as income and political affiliation. For example, a survey may use inferential statistics to determine if there is a significant relationship between income level and support for a particular political party based on a sample of survey data.

In conclusion, inferential statistics is a powerful tool for making predictions and drawing conclusions about a population based on a sample of data. It is used in various fields, including healthcare, business, education, and social sciences, to test hypotheses, make predictions, and identify relationships between different variables.

Statistical terms

  1. Research Survey – A study done using statistical methods in order to understand certain problem.
  2. Element – Respondent/object on which data is taken.
  3. Population – All elements under study either living or non-living object.
  4. Sample – Subset or part of population.
  5. Sampling frame – A complete list of all elements in a population.
  6. Pilot survey – A study done on a small scale before the actual survey.
  7. Census – A study done on the entire population.
  8. Parameters – a summary measure/characteristics obtained from population
  9. Statistics – a summary measure/characteristics obtained from sample
  10. Variable/Attribute – Characteristics of the population under study.

Two types of variable:

  1. Qualitative variable – measured according to their specific categories or characteristics. Example: gender (male, female), marital status (single, married), race (Malay, Indian, Chinese), grade (A, B, C)

  2. Quantitative variable – when the variable studied comes in term of numbers (numerical value) Example: number of student, total income, distance traveled, test mark etc.
    Quantitative variable can further be classified as:

    1. Discrete – assume only exact values
      • Example: no. of student, annual sales, total income, shoe size, etc.
    2. Continuous – can be expressed in a certain degree of accuracy
      • Example: Distance traveled litters of petrol, weight and height of children, etc.

Example 1:
Consider a population of 120,000 students in Terengganu. It was found that the mean height of the student is 148 cm and the variance is 1.5 cm . It also found that the mean height of 1,500 students in Dungun High School is 152 cm and the variance is 2 cm.

Population – 120, 000 students in Terengganu
Sample – 1,500 students in Dungun High School
Element – student
Variable – height of students

Parameter vs Statistic

Parameter is a numerical measure used to describe a population.

Statistic is a numerical measure used to describe a sample.

Below are some examples of parameter and statistic based on the information in the previous example.

Parameter: Statistics:
i. Population Size, N = 120,000 i. Sample Size, n = 1,500
ii. Mean, \(\mu\) = 148 cm ii. Mean, \(\bar{x}\) = 152 cm
iii. Variance, \(\sigma^2\) = 1.5 cm iii. Variance, \(s^2\) = 2 cm

Data and Measurement

Data is a collection of observations, measurements or information obtained from study that is carried out.

Sources of data:

  1. Primary data – data that is gathered and published for the first time by the researcher.
Advantages Disadvantages
i. satisfy the research objectives i. very costly
ii. more up to date ii. time consuming
iii. sensitive data is difficult to collect directly from the respondent
  1. Secondary data – data that is obtained from other sources (not the researcher) such as from annual report, journal, newspaper, internet etc.
Advantages Disadvantages
i. easy to obtain i. data might not satisfy the research objective
ii. less costly ii. there might be errors committed by the original researchers.
iii. can obtained in a large quantity

Measurement is simply the act of determining the quantity of values of a variable or assigning number to a variable.

Level of measurement:

  1. Nominal – qualitative as well as categorical
    • Example: gender (1= Male, 2= Female)
  2. Ordinal – categorical as well as essence of order (arranged in a certain order)
    • Example: level of education (1=Primary, 2=Secondary, 3=University, 4=Post-graduate)
  3. Interval – categorical, has order that can describe ‘how much more or how much less’ of a characteristic and has the existence of an ‘arbitrary zero point’.
    • Example: level of satisfaction, temperature
  4. Ratio – consist of all the characteristics discussed above plus another characteristics of ‘absolute (true) zero point’
    • Example: height of students

Exercise 1:

  1. Determine for the following whether you would use descriptive statistics or inferential statistics for the following information.
    1. A trainer wanted to determine the minimum time taken by his swimmers to swim 100m.
    2. An economist uses a bar chart to illustrate the loss made by an airline company from 1990 - 2000.
    3. A few botanists do a research on the relation between durian production and the usage of cow manure as fertilizer.
    4. Psychologists study whether urban students are higher achievers as compared to suburban students.
    5. Dewan Bandaraya Kuala Lumpur formed a committee to investigate the relation of flash floods occurrence and the amount of rubbish in Sungai Gombak and Sungai Kelang.
  2. Determine which of the following term is constant or variable. If it is a variable, determine whether it is quantitative or qualitative. If it is qualitative, determine whether it is discrete or continuous.
    1. Number of days in February.
    2. Marks to get grade B.
    3. Maximum marks to get grade B.
    4. Marital status of the workers in a firm.
    5. The length of 2000 screws in a production line

Sampling and Census

  1. Census - To study the whole population
Advantages Disadvantages
i. Data collected from all elements. i. Very costly and time consuming.
ii. Data are more complete. ii. Result would be out to date.
  1. Sampling - To study the sample. Sampling is the process of selecting a sample from a population.
Advantages Disadvantages
i. Less costly and required less time. i. Data not collected from all elements.
ii. Result is more up to date. ii. Data are less complete.

Sampling Techniques

Random Sampling

Also known as Probability Sampling Every elements in the population has equal chance to be selected as sample.

a. Simple Random Sampling (SRS)

Sampling frame must be available.

Two methods can be used to randomly select n elements, where n is the sample size:

  1. Lucky draw method
  2. Random numbers

Example 2:
A group of researcher planned to survey the family backgrounds of all students studying in UiTM. Due to time constraint, they decided to survey only 300 students. By using simple random sampling, discuss how they would select the sample.

Make a list of all the students who studying in UiTM. Assign each student a unique number, between 1 until the last students.

Using lucky draw:
Write the numbers on a small slip of paper and deposit all the slips in a box. The first selection is made by drawing a slip out of the box without looking at it. This process is repeated until the sample size of 300 is chosen.

Using random numbers:

  1. Refer to a table of random numbers. Starting at any point in the table read across or down and notes every number that falls between that numbers. Use the numbers you have found to pull the names from the list that correspond to the 300 numbers you found. These 300 students are your sample. OR

  2. Use random number generated by the computer software in order to select the sample. The person correspond to the numbers produced by the computer will be the sample.

Advantages Disadvantages
i. Every element has equal chance to be selected. i. Not suitable for heterogeneous population.

One example of conducting a simple random sampling using the lottery method is as follows:
Suppose a researcher wants to conduct a survey on the opinions of 100 students in a university regarding the quality of the food served in the cafeteria. The researcher could use the following steps to conduct a simple random sampling:
1. Assign a unique number to each student in the university. For example, the first student can be assigned number 1, the second student can be assigned number 2, and so on until all 100 students are assigned a number.
2. Write all the numbers on separate pieces of paper and put them in a bowl or hat.
3. Mix the pieces of paper thoroughly.
4. Blindfolded, select 10 pieces of paper from the bowl or hat without looking.
5. The students corresponding to the 10 selected numbers are the sample for the survey.
By using the lottery method, each student has an equal chance of being selected for the survey, ensuring that the sample is representative of the entire population.

b. Systematic Random Sampling (SYRS)

Sampling frame must be available. How to collect sample?

Step

  1. Identify the population size (N), and sample size (n).
  2. Obtained the range k by dividing the population size by the sample size.
    Sampling Interval, \(k\ =\ \frac{N}{n}\)
  3. Randomly select one element from the first \(k\) elements in the list (using SRS). Suppose the \(r_{th}\) th element is selected.
  4. Lastly sample every \(k_{th}\) element in the population begins with the \(r\) element until a sample of size n obtained. i.e., \(r_{th},\ (r+k)_{th},\ (r+2k)_{th},\ ...,\ (r+(n-1)k)_{th}\)

Example 3:
There are 200 elements in the population and a sample of 10 is desired. Discuss how the sample can be selected by using Systematic Random Sampling.

  1. \(N=200,\ n=10.\)
  2. Sampling Interval, \(k\ =\ \frac{200}{10}\ =\ 20\)
  3. Randomly select a number between 1 and 20. By using SRS, let say we select number 2.
  4. Then the sample shall consist of elements,
    \[ \begin{aligned} &2, 2+20, 2+2(20), 2+3(20), ......................., 2+9(20) \\ &2, 22, 42, 62, 82, 102, 122, 142, 162, 182. \\ \end{aligned} \]
Advantages Disadvantages
i. Every element has equal chance to be selected. i. More difficult to use.
ii. In order to get a good sample, population must be properly arranged.

c. Stratified Random Sampling

Applicable for population that is categorized such as according to sex, races, etc.

Characteristics of the population:

Example 4:
A group of research planned to survey all workers working in an industrial area. They are divided as followed. In order to save cost, they are decided to survey only 600 of the workers. Discuss how the sample can be selected by using stratified random sampling.

Race Sub Population Size Number of Sample
Malay 2800 \(n_1\ =\ \frac{2800}{4500}\ *\ 600\ =\ 373\)
Chinese 1250 \(n_1\ =\ \frac{1250}{4500}\ *\ 600\ =\ 167\)
Indian 450 \(n_1\ =\ \frac{450}{4500}\ *\ 600\ =\ 60\)
Total 4500 600

To sample each of the stratums, use either simple random sampling or systematic random sampling.

Advantages Disadvantages
i. Every element has equal chance to be selected. i. More difficult to use.
ii.Suitable for categorized population.

d. Cluster Sampling

Applicable for a population that is divided into homogeneous or similar cluster. Elements in the cluster are heterogeneous.

How to use cluster sampling?

Example 5:
A group of researchers planned to survey all family in Kuala Besut, living in 50 villages. In order to save cost, they decide to survey only 10 villages. Discuss by using cluster sampling.

Suppose you divide district Kuala Besut into 50 villages. Then by using simple random sampling or systematic random sampling, select 10 villages from 50 villages. Sampled each (all) of the elements in 10 villages.

Advantages Disadvantages
i. Suitable for a population that is quite large. i. Difficult to ensure that cluster are similar/homogeneous
ii. Suitable for clustered population.

e. Multi-Stage Sampling

Suitable for a large population. Selection done by stages.

Example 6:
A group of researchers planned to survey the background of all form 5 students in Terengganu. They decided to use sampling. Discuss.

Let say:
They randomly selected

  1. Five districts from 11 districts in Terengganu.
  2. 6 schools from each selected district.
  3. 25 students from each selected school.

Non-random Sampling

Also known as Non-Probability Sampling such that not all elements in the population has equal chance to be selected as sample.

a. Quota Sampling

Suitable to be used if sampling frame not available and in market research.

Example 7:
A group of researcher planned to survey 120 house-owners in Dungun who have been using Sharp washing machines for more than 2 years. Discuss.

The numbers allocated for each group of respondents is based on the population statistics. The researcher has the flexibility to choose whomever he wants as long as the specifications set are met. 

b. Convenience Sampling

The researcher has the flexibility to select anybody that they wants or meets until the required sampled is obtained

Convenience sampling is a non-probability sampling technique where individuals or objects are selected based on their availability and accessibility. This type of sampling is often used when the researcher needs a quick and easy way to collect data.

A common example of convenience sampling is conducting a survey on a college campus. The researcher may stand in a busy area of the campus and approach individuals who happen to walk by, asking if they would like to participate in the survey. This type of sampling can be quick and easy, but it is not representative of the larger population, as individuals who are not on the college campus at that time will not be included in the sample.

Another example is conducting an online survey and sharing the link on social media platforms. The researcher can quickly collect responses from individuals who are easily accessible through social media. However, this type of sampling may not be representative of the larger population as it only includes individuals who are active on social media platforms and willing to take the survey.

Convenient sampling can be useful for exploratory studies, pilot studies, or when time and resources are limited. However, the results may not be generalizable to the larger population, and it is important to recognize the limitations of the sampling technique when interpreting the results.

c. Judgmental Sampling

The researcher selects a respondent whom he thinks has a certain characteristics that he wants to study

Judgmental sampling is a non-probability sampling technique in which the researcher selects participants based on their judgment or expertise in a particular area. The researcher may choose to include individuals who are thought to be representative of the population or who have a unique perspective on the topic of interest. This method is often used when a particular group of individuals is difficult to reach or when a researcher wants to focus on a specific subpopulation.

For example, a researcher studying the attitudes of professional athletes towards doping may use judgmental sampling to select former athletes who have publicly spoken out against doping. The researcher may believe that these individuals will provide valuable insights and perspectives on the topic due to their experience in the field.

Another example could be a market research study on a new product where the researcher selects a few experts or industry leaders who have extensive knowledge of the target market to provide their opinion on the product.

It is important to note that judgmental sampling is a subjective method and can be prone to bias as the researcher’s personal judgment may influence the selection process. Therefore, this method may not provide a representative sample of the population, and the findings may not be generalizable to the larger population.

d. Snowball Sampling

An initial group of respondent is selected usually at random. After being interviewed, these respondents are asked to identify others who belong to the target population of interest.

Snowball sampling is a type of non-probability sampling technique where participants are selected based on referrals from other participants in the study. This technique is often used when the population of interest is difficult to identify or access.

The process of snowball sampling begins with the researcher identifying a few initial participants who meet the study’s inclusion criteria. These participants are then asked to refer other individuals they know who also meet the criteria. This process continues until the desired sample size is reached.

For example, imagine a researcher is interested in studying the experiences of individuals who have recovered from drug addiction. The researcher may start by recruiting a few individuals who have gone through a drug addiction recovery program. These initial participants may be asked to refer others they know who have also gone through recovery programs. The process continues, and the researcher interviews each new participant, asking them to refer others they know who have gone through recovery programs.

One advantage of snowball sampling is that it can help researchers identify individuals who are difficult to locate or may be hesitant to participate in the study. However, a disadvantage is that the sample may be biased towards individuals who are more socially connected or have more extensive networks, potentially limiting the generalizability of the study’s findings.

Data Collection Method

Generally there are 6 methods of data collection that can be used in order to collect the primary data. They are:

i. Personal interview

Researcher talks to the respondent face to face.

Advantages Disadvantages
i. Produce the highest response rate. i. Very costly and time consuming.
ii. Can explain any unclear questions ii. Interviewers must be properly trained.

ii. Telephone interview

Interviewer asks questions from a prepared questionnaire

Advantages Disadvantages
i. Less costly and required less time. i. Appropriate only for population with telephones.
ii. Can contact respondents several times. ii. Respondents might refuse to cooperate.

iii. Mailing

A questionnaire is sent to each respondent with a stamped addressed envelope attached.

Advantages Disadvantages
i. Less costly. i. Response rate very low.
ii. Can be used in any population size. ii. Unsure when the questionnaires shall come back.

iv. Direct observation

Respondents will be observed without their knowledge

Advantages Disadvantages
i. Data obtained very accurate. i. Very costly and time consuming.
ii. The access of information is not affected by the respondents. ii. The observer needs to be highly skilled and unbiased

v. Direct Questionnaire

The researcher gives the questionnaire directly to the respondent and waits for them to complete it.

vi. Other methods

Electronic e-mail, internet survey and short messaging services (SMS).

Designing A Questionnaire

Before you begin drafting your questionnaire, it is important to consider:

Some guidelines in designing a questionnaire

  1. Design questions to meet the objective of the research.
  2. Questions must be short and clear.
  3. Limit the number of questions.
  4. Use language understood by any layman.
  5. Doubled – Barreled Questions should be avoided. E.g. Do you think there is a good market for the product and that it will sell well? Could bring a “yes” response to the first part and a “no” response to the latter part.
  6. Ambiguous Questions should be avoided. E.g. “To what extent would you say you are happy? Respondents might be unsure whether the question refers to their state of feelings at the workplace, or at home, or in general.
  7. Avoid questions that might require respondents to recall experiences from the past, they may be unable to give correct answers and may be way off in his responses.
  8. Leading questions should be avoided. E.g. Don’t you think that in these days of escalating costs of living, employees should be given good pay raises? By asking such a question, we are signalling and pressuring respondents to say “yes”. Another way of asking to elicit less biased responses would be: “To what extent do you agree that employees should be given higher pay raises?

Exercise 2:

  1. A researcher wishes to study the career aspirations of students from the Faculty of Accountancy, which consists of 50 classes. The researcher intends to choose only 10 classes and all the students from these 10 classes will be chosen for the study.

    1. State the population for the above study
    2. State the variable for this study. What type of variable is it?
    3. State the sampling technique that is used for this study.
  2. A group of researchers from Yayasan ABC conducted a survey on their sponsored students who are currently pursuing their studies at local universities. The purpose of the study is to determine the average monthly amount spent on academic books by these students. A list of 350 students’ names arranged alphabetically and addresses was obtained. A random sample of 70 students was selected from the list.

    1. State the population of the study.
    2. State the variable mentioned in the study.
    3. Suggest an appropriate sampling technique to be used.
    4. Explain how the sampling technique chosen in (iii) is carried out.
    5. What is the most suitable data collection method to be used for the above study? Give one advantage and one disadvantage of this method.