Basic Concepts of Statistics

Statistics is the branch of mathematics that deals with the collection, organization, analysis, and interpretation of numerical data. Since research yields such quantitative data, statistics is a basic tool of measurement, evaluation, and research.

The word statistics is sometimes used to refer to any measure computed on the basis of data obtained from a characteristic of a population under study. If one intends to know the average number of senior citizens in a certain district, and this average will be taken from only 292 of the 20,000 families of that community, then this average constitutes a statistic. However, if the entire 20,000 families were used in the calculation of the average number of senior citizens, then the resulting average is now referred to as a parameter. The whole group of 20,000 families about which we make estimates is the population or universe, and the smaller group of 392 families, which we selected as part of the population is the sample.

VARIABLES

An important characteristic of many research questions is that they imply a relationship of some sort to be investigated. At this instance, it is necessary to introduce the concept of variables for a relationship is a statement about variables. A variable is any observable characteristic of a person or object which may taken on several values or may be expressed in several different categories – the individual members in the class of objects must differ or vary to qualify the class as a variable. However, if all members of the class are identical, we do not have a variable – such characteristic is called a constant, since the individual members of the class are not allowed to vary.

QUANTITATIVE AND CATEGORICAL VARIABLES

Variables can be classified as either quantitative or categorical. Quantitative variables exist in some degree (rather than all or none) along a continuum from less to more, and we can assign numbers to different individuals or objects to indicate how much of the variable they possess. Variables that can take specific or isolated values along a scale called discrete variables. For example, when we count the number of OFWs and the likes, we have a discrete variable.

Variables that do not fall under the category of discrete are called continuous variables. These variables are measurable such as the exact ages of students in a certain class, the height and the eights of children, and temperature.

By the way of contrast, categorical variables do not vary in degree, amount or quantity but are categorically different. Variables such as gender, eye color, religion, occupation, position and level of performance on a job are categorical.

Variables may also be classified as dependent, independent and intervening variables. The independent variable is presumed to affect or influence other variables. The dependent or outcome variable is presumed to be affected by one or more independent variables. An intervening variable is an independent variable that may have unintended effects on a dependent variable in a particular study.

Figure 1 shows the illustration of independent, intervening and dependent variables.

Independent Variable Intervening Variable Dependent variable

Educational attainment Age, Gender, Civil status, Performance

Length of service, socio-economic status

INSTRUMENTATION

The collection of data is an extremely important part of any type of research, for the conclusions of a study are based on what the data show. Thus, the kind of data to be gathered, the method to be used in the gathering of data, and the treatment of the data need to be considered carefully. The success and usefulness of the result of the study will depend much on the accuracy and reliability of the data. Bear in mind that no statistical treatment can make unreliable data correct.

Data refers to the kinds of information researches obtain on the subjects of their research. They can be classified according to source and to form.

TYPES OF DATA ACCORDING TO SOURCE

1. Primary data. These are data that are gathered directly from the respondents of the study through observation, interview, questionnaire experiment, or measurement.

2. Secondary data. These are data that have been previously gathered, compiled and are made available to the researcher for analysis. These include books, journals, records, reports and other publication.

TYPES OF DATA ACCORDING TO FORM

1. Quantitative data. These are data that are measured on a scale.

2. Qualitative or Categorical data. These are observations that can be classified into a single category or a set of categories.

An important decision every researcher makes during the planning stage of his investigation is the selection of the kind of data he intends to collect the device (such as a pencil-and-paper test, a questionnaire, or a rating scale) the researcher uses to gather data is called an instrument. The whole process of gathering data is known as instrumentation.

DATA COLLECTION METHODS

OBSERVATION. Observation is one of the earliest methods for acquiring knowledge. In this method, the researcher watches closely the overt behaviors of the subjects under investigation in various natural settings. Observations may be done by actual participation, which allows the researcher to gain detailed and comprehensive picture of the respondents. This is known as participant observation. However, the researcher is cautioned not to get emotionally involved in the group for it may lose the objectivity of the study.

It would be advantageous for the study that the respondents are not aware that they are being observed so that they will behave naturally. This kind of observation is known as non-participant observation.

Observation may also be classified as structured type and unstructured type. In structured observation, the researcher makes use of an observation guide that limits the focus of his observations to aspects of behavior and activities or events relevant to the research problem and activities. The unstructured observation is open and flexible because the researcher does not restrict his activity within an observation guide. This gives the researcher an opportunity to modify the objectives of his study as he gathers more data about the research problem.

Interview. Interview is a method of personal communication between the researcher and the respondents. This method provides consistent and precise information to the researcher because the respondent may classify the information. It is probably the most effective way there is to enlist the cooperation of the respondents.

TYPES OF INTERVIEW

1. Structured interview. This type of interview uses a research instrument called interview schedule. An interview schedule is made up of carefully prepared and logically ordered questions.

2. Unstructured interview. This type of interview is open and flexible. The contents, sequence and wordings of the questions are up to the researcher who makes use of an interview guide which is the listing of topics that will be taken up during the interview process.

Questionnaire. In this method the subject responds to the questions by writing, or, more commonly, marking an answer sheet. The advantages of questionnaires are they can be mailed to given to large numbers of people at the same time. The disadvantages are the unclear or seemingly ambiguous questions cannot be classified, and the respondents had no chance to expand on, or react verbally to a question of particular interest or importance.

MEASUREMENT SCALES

A variable uses a different type of analysis and measurement, requiring the use of different measurement scales. Measurement scales are ways of assigning numerals to variables. There are four type of measurement scales and these are nominal, ordinal, interval and ratio scale.

Nominal scale. This scale is the simplest, and most limited form of measurement researches can use. It is merely used to differentiate categories in order to show differences. For example, the respondents under study may be grouped according to their gender and the researcher may then assign the number 1 to females and the number 2 to males. Since these numbers are simply used for identification purposes, no implication that the males (assigned number 2) is more anything than the females (assigned number 1).

Ordinal scale. An ordinal scale is one which data are not only classified but also ordered some way – high low or least to most. For instance, a researcher might rank-order teacher’s performance ratings from high to low. Notice, however, that the difference in ratings or in actual performance between the first and the second-ranked teachers and between the third-and fourth-ranked students would not necessarily be the same. Ordinal scales indicate relative standing among individuals.

Interval scale. Interval scale has the attributes of ordinal scales plus another feature: the distances between the points on the scale are equal. Examples of interval measurements are achievement test scores, mental ability scores and temperature scales. This, it two students got scores of 75 and 80, respectively, in an achievement test, the distance between these scores is said to be same as the distance between the two pupils who got scores of 90 and 95. The zero point on an interval scale does not reflect a total absence of what is being measured. Thus, O on the Celsius scale does not indicate that the object has no temperature.

Ratio Scale. Ratio scale is similar to the interval scale only it has an actual, or true zero point which indicates a total absence of the property being measured. For example, a scale designed to measure weight would be a ratio scale, because the zero on the scale represents zero, or no weight at all.

According to John Best, the researcher who uses statistics goes beyond the manipulation of data. He is aware that the proper application of statistical method involves answering the following questions:

1. What facts need to be gathered to provide the information necessary to answer the question or to test the hypothesis?

2. How are these observations to be selected, gathered, organized and analyzed?

3. What assumptions underlie the statistical methodology to be employed?

4. What conclusions can be validly drawn from the analysis of the data?

Research consists of careful, systematic, patient study and investigation in some field of knowledge undertaken for the purpose of discovering relationships between variables. The ultimate purpose is to obtain evidence to support or refute proposed facts or principles that may be used to explain phenomena and predict future occurrences. To conduct research, principles must be established so that the observation and description have a commonly understood meaning. Measurement is the most reliable and universally accepted process of description, assigning quantitative values to the properties of objects and events.

DESCRIPTIVE AND INFERENTIAL ANALYSIS

After instruments have been administered and data have been collected and organized, the first step in data analysis is to describe it in a summary fashion using one or more descriptive analysis.

Descriptive analysis. This type of statistical analysis limits generalization to the particular group of individuals observed. No conclusions can be made beyond this group and any similarity to those outside the group cannot be assumed. The data describe one group and the group only lead to committing the type II error that is, accepting the null hypothesis instead of rejecting it.

A correlated t-test is more powerful than is an independent t-test when the subjects are truly dependent on each other. The independent t-test is more powerful when the subjects are independently selected and assigned.

Experimental research. Experimental research is one of the most powerful research methodologies researchers can use. Of the many types of research, it is the best way to establish cause-and-effect relationships between variables.

In an experimental study, researchers look at the effects of at least one independent variable on one or more dependent variables. The independent variable in experimental research is frequently referred to as the experimental or treatment variable. The dependent variable, also known as the criterion or outcome variable refers to the results or outcomes of the study.

The major characteristic of experimental research, which distinguishes it from all other types of research, is that researchers manipulate the independent variable. They decide the nature of the treatment, to which it is to be applied, and to what extent. Independent variables frequently manipulated in educational research include methods of instruction, types of assignment, learning materials, rewards given to students, and types of questions asked by teachers.

SAMPLING

A population refers to the entire group or set of individuals or items to whom the researchers would like to generalize the results of the study. A population is further distinguished by its role in the study. Consider the following examples.

Research problem: The Effects of Multimedia Instruction on the Mathematical Achievement of first year High School Students in the Division of City Schools of Manila.

Target Population: All first year high school students in the division of City Schools of Manila.

Accessible Population: All first year high school students in the pilot high schools, Division of City Schools Manila

Sample: Ten percent of the first year high school students in the pilot high schools, Division of City Schools, Manila.

A sample is a group of individuals in a research study on which information or generalization about the population is drawn. It is essential that researchers describe the population and the sample in sufficient detail so that others can determine the applicability of the findings to their own situations.

DETERMINING SUFFICIENT SAMPLE SIZE FROM THE POPULATION

A common practice of some researchers who do not know the systematic way of determining the adequate sample size is the use of getting percentage from the population. They base their sample size on their subjective decision.

To avoid such problem, we shall discuss two formulas in determining sufficient sample size. It is up to you to choose the one you prefer when you conduct your research study.

The Slovin Formula. The Slovin formula is the most common formula the researchers use in determining the sufficient sample size. The reason why most researchers utilize this formula is that they find it convenient to compute. The formula is indicated below:

Where: n = sample size; N = population size; e = the desired margin of error which is usually 0.05

There are few guidelines suggested by Fraenkel, with regard to the minimum number of 100 is essential. For correlational studies, a sample of at least 50 is deemed necessary to establish the existence of a relationship. For experimental and causal-comparative studies, a minimum of 30 individuals per group, although sometimes experimental studies with only 15 individuals in each group can be defended if they are very tightly controlled; studies using 15 subjects per group should be replicated, however, before too much is made of any findings that occur.

SAMPLING DESIGNS

The term sampling as used in research, refers to the process of selecting the subjects who will participate in a research study. Sampling can be classified into two basic types: non-probability and probability. In the non-probability type, there is no way in estimating the probability that each individual or element will be included in the sample. In probability sampling, each member of the population does have an equal chance of being chosen as representative sample.

PROBABILITY SAMPLING

1. Simple random sampling. The basic type and most popular sampling design is the simple random sampling. A simple random sample is one in which each and every member of the population has an equal and independent chance of being chosen. As such, it is considered the best sampling design.

In this design, samples are picked either by the use of table of random numbers or lottery techniques. The table of random numbers is an extremely large list of numbers that has no order or pattern. Such lists can be found in the appendices of most statistics books. It uses columns or rows of numerical digits that were mechanically generated. In choosing sample units, the digits to be used in the table should correspond to the digits of the population. For example, to obtain a sample of 200 from a population of 2000 individuals select a column of numbers, start anywhere in the column, and begin reading four digits numbers. This is the case since the final number 2000 consists of four digits. Use the same number of digits for each individual. Individual 1 would be known as 0001; individual 2 as 0002 and so forth. The researcher would then proceed to write down the first 200 numbers in the column that have a value of 2000 or less.

In the lottery technique, each population unit is assigned a number that is written on a slip of paper. The papers that are physically identical, are put into a bowl or a box and mixed thoroughly. Then the samples are drawn one at a time until the desired sample size is reached.

2. Systematic sampling. This is a modified form of simple random sampling. It involves selecting every kth element in the population until the desired number of samples is obtained. The value of k (sampling interval) is determined by dividing the population size (N) by the sample size (n). The quotient is then rounded off to the next integer. For example, in a population list of 5000 names, to select a sample of 370 a researcher would select every 14^th name on the list until a total of 370 names were chosen.

The problem with systematic sampling is that it is sometime overlooked. If the population has been ordered systematically – that is, if the individuals are arranged in some sort of pattern that accidentally coincides with the sampling interval – a markedly biased can result. When planning to select a sample from a list of some sort, researchers should carefully examine the list to make sure there is no cyclical pattern present.

3. Stratefied Random Sampling. This is the process of subdividing the population into subgroups or strata and drawing members at random from each subgroup or stratum in the same proportion as they exist in the population. Suppose a researcher wants to find out the effectiveness of the new Mathematics 1 worktext a certain school district is considering adopting. She intends to compare the achievement of students using the new book with that of students using the more traditional text the district has been using for years. Since she has reason to believe that gender is a variable that may affect the outcomes of her study, she decides to ensure that the proportion of males and females in the research is the same as in the population. The steps in the sampling process would be as follows:

A) She identifies the target population. The population consists of all 3500 First year public high school students in the district enrolled during the school year 1999-2000.

B) She determines the desired sample size. Using the slovin’s formula, we get a sample of 359 students from the population of 3500.

C) She identifies the strata into which the population has been subdivided.

The population has been subdivided by gender. The researcher finds out that there are 2,100 females (60 percent) and 1,400 males (40 percent) in the population.

D) She determines the number of respondents to be selected from each stratum 60 percent of 359 = 215 females; 40 percent of 359 = 144 males

E) Using a table of random numbers, she then randomly selects 215 females and 144 males from the population. The advantage of stratified random sampling is that it increases the likehood of representatives, especially if one’s sample is not very large. It ensures that the key characteristics of the respondents in the population are taken in the same proportions in the sample.

4. Cluster random sampling. The selection of groups, or clusters of subjects rather than individuals is known as cluster sampling. This sampling design is used when the population is very large and widely spread out over a wide geographical area. Just as simple random sampling is more effective with larger number of subjects, so is cluster random sampling more effective with the larger number of clusters.

The following steps show how to determine a cluster of mathematics teachers in a particular Division of City Schools

A) Identify the population. The population, for example, consists of all the 800 mathematics teachers in 36 schools in a given division.

B) Determine the sample size. A sample of 267 teachers will be selected from the population.

C) Identify the logical cluster for the given population. The logical cluster for the population would be by school.

D) Obtain a list of the clusters comprising the population. A list of all the schools in the given city is needed.

E) Estimate the average number of members per cluster in the population. Let us say that the average number of teachers per school is 14.

F) Determine the number of clusters to be selected from the population by dividing the required sample size by the average number of members per cluster in the population. In the example, the desired size is 267 and there is an average of 14 teachers per school. Hence the number of clusters needed is 267/14 – 19.07 or 19.

G) Randomly select the needed number of clusters. Using simple random sampling, 19 schools shall be selected from the population list of 36 schools.

H) Include all the members in the selected clusters. All the teachers in each of the 19 schools selected will comprise the desired sample. The advantages of cluster sampling are that it can be used when it is difficult or impossible to select a random sample of individuals, it is often far easier to implement, and it is frequency less time-consuming. Its disadvantage is that there is a far greater chance of selecting a sam0ple that is not representative of the population.

5. Multi-Stage Sampling. This design is an extended version of cluster sampling. It involves several stages in drawing the representative sample from the population. The population units are grouped into hierarchy of elements, and sampling is done successively. For example, in a nationwide research, regions are selected at the first stage, provinces at the second stage, cities and municipalities at the third stage, barangays of the sample cities and municipalities at the fourth stage, and finally individual respondents within the selected barangays at the fifth stage. At each stage, simple random, systematic, or stratified designs may be utilized.

NON-PROBABILITY SAMPLING

1. Convenience sampling. This design is resorted to when it is extremely difficult to select a random sample. Thus a researcher simply takes the closes persons (convenience sample) who are available for study. For example, if a researcher is interested to find out how jeepney drivers feel about the oil deregulation law, she just goes to the nearest jeepney station and interviews the first 50 jeepneys who are in there. Convenience samples cannot be considered representative of any population and should be avoided. However, if they are the only choice a researcher has, the study should be replicated with a number of similar samples to decrease the likelihood that the results obtained were simply a one-time occurrence.

2. Purposive Sampling. This design is also known as judgmental sampling. A purposive sample is a sample selected because the individuals have special qualifications of some sort. Usually, this set of qualifications meet the purposes of the researchers study. For example, a researcher is interested in the changes in the sexual behavior of people in middle life. She sets the qualification of middle-aged person as one whose age in the range of 40 to 55 years old. Whoever qualifies and is available is taken until the desired number of sample is attained.

3. Snowball sampling. This design requires identification of a few persons whose qualifications meet the purposes of the study. These persons serve as informants leading the researchers to other individuals who qualify for inclusion in the sample who in turn, lead to more persons who can be interviewed. This process goes on and on until the desired number of respondents is obtained.

THE FREQUENCY DISTRIBUTION

Data may be arranged alphabetically, chronologically, in ranked form or by using arrays. The choice of arrangement depends on the purpose/s of the researcher.

The aforementioned ways of organizing data, however, would not yield much information nor would give the reader a clear picture of the whole situation. They would just make the task of treating the data statistically more convenient.

UNGROUPED DATA DISTRIBUTIONS

The list of test scores in a teacher’s class record provided an example of ungrouped data. Since the usual method of listing is alphabetical, the scores are difficult to interpret without some other type of organization.

Example:

Alvarez, Oliver 68

Ballesteros, Antonio 79

Dacanay, Alfredo 98

Salinas, Romel 70

Wenceslao, Paulino 88

The Array. Arranging the same set of scores in descending or ascending order of magnitude produces what is known as array.

The array provides a more convenient arrangement. In the above example, the highest score (98), the lowest score (68) and the middle score (79) are easily identified. The range (the difference between the highest score and the lowest score) can easily be determined.

GROUPED DATA DISTRIBUTIONS

An array may help make the overall pattern of data apparent. However, if the number of scores is large construction of the array may have to be done on a computer, and even then, the array may turn out to be so large that it is difficult to understand.

Data are more clearly presented when scores are grouped or presented in a frequency distribution. A frequency distribution is a tabular summary of a set of data that shows the frequency or number of data items that fall in each of several distinct classes. To construct a frequency distribution, study the test scores of students presented below and then follow the steps specified below the table.

THE SCORES OF 50 STUDENTS IN THE ENGLISH 1 MIDTERM EXAM

29 25 23 20 18 17 15 13 10 9

28 24 21 20 18 16 15 12 10 9

27 24 20 19 18 16 15 12 9 8

26 23 20 19 17 16 14 10 9 8

26 23 20 19 17 16 14 10 9 6

1. Find the range. Range = Highest score – lowest score; Range = 29 – 6 = 23

2. Determine the tentative or approximated number of classes (k). K = 1 + 3.322 log N. k = 1 + 3.322 log N= 1 + 3.322 log 50= k =7

Note: always round-off the quotient to the next integer

3. Determine the approximate size of the class interval ( c ). C = range/k; 23/7 = 3.29 = 4.

Note: always round-off the quotient to the next integer)

4. Write the class intervals starting with the lowest score. Stop when the class already includes the highest score.

5. Determine the class frequency for each class interval by referring to the tally column and present the results in a table.

Weights (in Kg) of 40 Grade 1 Pupils in a Public School

Class limit Weight(in kg) Class Boundaries Class Mark Class Frequency Relative Frequency Percentage of Observations

89 – 93 88.5-93.5 91 3 0.075 7.5

84 – 88 83.5-88.5 86 5 0.125 12.5

79 – 83 78.5-83.5 81 11 0.275 27.5

74 – 78 73.5-78.5 76 12 0.300 30.0

69 – 73 68.5-73.5 71 5 0.125 12.50

64 – 68 63.5-68.5 66 3 0.075 7.5

59 – 63 58.5-63.5 61 1 0.025 2.5

The lowest and highest values that can fit in a class are called the lower class limit and upper class limit, respectively. In the class step 59 – 63 of the table is the lower class limit and 63 is the upper class limit. The difference between the lower class limit of one class and the lower class limit of the next class is the class width. Each class should have the same class width or interval (c), although it is not uncommon to see either the first of the last class width a little longer of shorter than the others. The center of the class is called the midpoint. The midpoint is computed by getting one-half of the sum of the lower and the upper class limits of one class. For example: Class mark (midpoint) = (59 + 63)/2 = 61.

The class boundaries are the true limits of a class defined by a lower boundary and an upper boundary. Each class boundary equals the number midway between the upper limit of a class and the lower limit of the next higher class. A relative frequency distribution indicates the proportion of the number of observations that is occurring in each interval. That is relative frequency = f/n.

OC GRADUATE SCHOOL

Sunday, June 23, 2013

BASIC CONCEPTS IN STATISTICS