- CHAPTER A: BASICS OF STATISTICS
- CHAPTER B: GRAPHICAL DESCRIPTIVE TECHNIQUES
- CHAPTER C: GRAPHICAL PRESENTATIONS
- CHAPTER D: NUMERICAL DESCRIPTIVE TECHNIQUES
- CHAPTER E: PROBABILITY
- CHAPTER F: DISCRETE PROBABILITY DISTRIBUTIONS
- CHAPTER G: CONTINOUS PROBABILITY DISTRIBUTIONS
- CHAPTER H: SAMPLING DISTRIBUTIONS
- CHAPTER I: ESTIMATION
- CHAPTER J: HYPOTHESIS TESTING
- CHAPTER K: HOW TO MAKE INFERENCES ABOUT A POPULATION
- CHAPTER L: HOW TO MAKE INFERENCES ABOUT COMPARING TWO POPULATIONS
- CHAPTER M: STATISTICAL TECHNIQUES INVOLVING NOMINAL DATA
- CHAPTER N: REGRESSION AND CORRELATION
- CHAPTER O: NONPARAMETRIC STATISTICS
CHAPTER A: BASICS OF STATISTICS
Statistics is a way to get information from data. There are two main branches of statistics:
Descriptive statistics, which are concerned with methods of organizing, summarizing, and presenting data in a convenient and informative way. Descriptive statistics make use of graphical and numerical techniques to summarize and present data in a clear way. The actual technique used depends on what specific information needs to be extracted.
Inferential statistics, which is a body of methods used to draw conclusions or inferences about characteristics of a population based on sample data (although, a sample that is only a small fraction of the size of the population can lead to correct inferences only a certain percentage of the time).
A.1 Key statistical concepts
Statistical inference problems involve three key concepts:
A population is the group of all items of interest to a researcher (note: population does not necessarily refer to a group of people). It is frequently very large and may, in fact, be infinitely large. A descriptive measure of a population is called a parameter. In most applications of inferential statistics the parameter represents the information which is needed.
A sample is a set of data drawn from the population. A descriptive measure of a sample is called a statistic. Statistics are used to make inferences about parameters.
Statistical inference is the process of making an estimate, prediction, or decision about a population based on sample data. In the statistical inference there are two measures of reliability:
the confidence level, which is the proportion of times that an estimating procedure will be correct; and
the significance level, which measures how frequently the conclusion will be wrong in the long run.
CHAPTER B: GRAPHICAL DESCRIPTIVE TECHNIQUES
B.1 Types of data
Some basic terms related to the concept of data:
A variable is some characteristic of a population or sample. The name of the variable is usually represented using upper case letters such as X, Y, and Z.
The values of the variable are the possible observations of the variable.
Data are the most observed values of a variable.
There are three types of data:
Interval data are real numbers, (for instance, incomes and distances). This type of data is also referred to as quantitative or numerical.
The values of nominal data are categories. For instance, answers to questions about marital status produce nominal data. The values are not numbers but instead are words describing the categories. Nominal data are also called qualitative or categorical.
Ordinal data appear to be nominal, but their values are in order. Because the only constraint that is imposed on the choice of codes is that the order must be maintained, any set of codes that are in order can be used.
The critical difference between those three types of data is that the intervals or differences between values of interval data are consistent and meaningful. For instance, the difference between grades of 10 and 8 is the same two-grade difference that exists between 8 and 6.Thus, a researcher can calculate the difference and interpret the results. Because the codes representing ordinal data are arbitrarily assigned except for the order, a researcher cannot calculate and interpret differences.
All calculations are permitted on interval data. A set of interval data is often described by calculating the average. No calculations can be performed on the codes of nominal data, because these codes are completely arbitrary. Thus, calculations based on the codes used to store nominal data are meaningless. All that a researcher can do with nominal data is count the occurrences of each category. The only permissible calculations on ordinal data are ones involving a ranking process.
The data types can be placed in order of the permissible calculations. At the top of the list there is the interval data type (because virtually all computations are allowed). At the bottom of the list there is the nominal data type (because no calculations other than determining frequencies are permitted). In between interval and nominal data lies the ordinal data type. Note: higher-level data types may be treated as lower-level ones. For instance, in universities the grades in a course (interval data), can be converted to letter grades (ordinal data). Lower-level data types cannot be treated as high-level types.
The variables whose observations constitute the data are given the same name as the type of data. Thus, for instance, nominal data are the observations of a nominal variable.
B.2 Graphical techniques to describe nominal data
The only allowable calculation on nominal data is to count the frequency of each value of the variable. The data can be summarized in a table that presents the categories and their counts called a frequency distribution. A relative frequency distribution lists the categories and the proportion with which each occurs. There are two graphical methods which can be used to present a picture of the data:
A bar chart, which is often used to display frequencies; and
a pie chart, which graphically shows relative frequencies.
A bar chart is created by drawing a rectangle representing each category. The height of the rectangle represents the frequency and its base is arbitrary. A pie chart is simply a circle subdivided into slices that represent the categories. It is drawn so that the size of each slice is proportional to the percentage corresponding to that category.
B.3 Graphical techniques to describe interval data
There are several graphical methods which are used when the data are interval. The most important of these graphical methods is the histogram – it can be used to summarize interval data or to help explain an important aspect of probability.
A frequency distribution for interval data is created by counting the number of observations that fall into each of a series of intervals (classes) that cover the complete range of observations.
Although the frequency distribution provides information about how the numbers are distributed, the information is more easily understood and imparted by drawing a picture or graph. The graph is called a histogram. A histogram is created by drawing rectangles whose bases are the intervals and whose heights are the frequencies.
The number of class intervals selected depends entirely on the number of observations in the data set. The more observations available, the larger the number of class intervals needed to draw a useful histogram. For instance, for less than 50 observations, a researcher would normally create between 5 and 7 classes; for more than 50000 observations, a researcher would normally use between 17-20 classes. (More detailed guidelines are presented in Table 2.6 on page 35.)
Alternatively, a researcher can use Sturge’s formula, which recommends that the number of class intervals be determined by the following:
Number of class intervals = 1 + 3.3 log(n)
For instance, if n = 100, number of class intervals = 1 + 3.3 log(100) = 1 + 3.3(2) = 7.6 (which is rounded to 8).
The approximate width of the classes is determined by subtracting the smallest observation from the largest and dividing the difference by the number of classes. Thus,
Largest observation – Smallest observation
Class width = ----------------------------------------------------------
Number of classes
The result is often rounded to some convenient value. Consequently, the class limits are defined by selecting a lower limit for the first class from which all other limits are determined. The only condition to apply is that the first class interval must contain the smallest observation.
The shape of histograms is described on the basis of the following characteristics:
Symmetry - A histogram is said to be symmetric if, when a vertical line down the center of the histogram is drawn, the two sides are identical in shape and size.
Skewness - A skewed histogram is one with a long tail extending to either the right or the left. The one which extends to the right is called positively skewed, and the one which extends to the left is called negatively skewed.
Number of modal classes - A mode is the observation that occurs with the greatest frequency. A modal class is the class with the largest number of observations. A unimodal histogram is one with a single peak. A bimodal histogram is one with two peaks, not necessarily equal in height. Bimodal histograms often indicate that two different distributions are present.
Bell shape - A special type of symmetric unimodal histogram is one that is bell shaped (such as the one presented in Figure 2.10 on page 37).
One of the drawbacks of the histogram is that potentially useful information can be lost by classifying the observations. A stem-and-leaf display is a method which partially overcomes this loss. The first step in developing a stem-and-leaf display is to split each observation into two parts, a stem and a leaf. There are several different ways of doing this. For instance, the number 15.6 can be split so that the stem is 15 and the leaf is 6. In this definition the stem consists of the digits to the left of the decimal and the leaf is the digit to the right of the decimal. Another method can define the stem as 1 and the leaf as 5. In this definition the stem is the number of tens and the leaf is the number of ones. The stem-and-leaf display is similar to a histogram turned on its side. The length of each line represents the frequency in the class interval defined by the sets. The advantage of the stem-and-leaf display over the histogram is that the actual observations can be seen.
The frequency distribution lists the number of observations that fall into each class interval. A relative frequency distribution can also be created by dividing the frequencies by the number of observations.
The relative frequency distribution highlights the proportion of the observations that fall into each class. In some situations a researcher may want to determine the proportion of observations that fall below each of the class limits. In such cases he needs to create cumulative relative frequency distribution. Another way of presenting this information is the ogive, which is a graphical representation of the cumulative relative frequencies.
B.4 Describe time-series data
Data can be also classified in the following way:
cross-sectional data, which are the observations are measured at the same time;
time-series data, which represent measurements at successive points in time.
Time-series data are often graphically depicted on a line chart, which is a plot of the variable over time. It is created by plotting the value of the variable on the vertical axis and the time periods on the horizontal axis.
B.5 Describe linear relationship between two nominal variables and compare
Techniques applied to single sets of data are called univariate. When a researcher wants to depict the relationship between variables, bivariate methods are required.
A cross-classification table (also called a cross-tabulation table) is used to describe the relationship between two nominal variables. There are several ways to store the data to be used to produce a table and/or a bar or pie chart.
The data are in two columns where the first column represents the categories of the first nominal variable and the second column stores the categories for the second variable. Each row represents one observation of the two variables. The number of observations in each column must be the same.
The data are stored in two or more columns where each column represents the same variable in a different sample or population. To produce a cross-classification table, the number of observations of each category in each column has to be counted.
B.6 Describe relationship between two interval variables
Researchers often need to know how two interval variables are related. The technique used to describe the relationship between such variables is called a scatter diagram. To draw a scatter diagram a researcher needs data for two variables. In applications where one variable depends to some degree on the other variable the dependent variable is labeled Y and the other, called the independent variable is labeled X. In other cases where there is no dependency evident, the variables are labeled arbitrarily.
To determine the strength of the linear relationship a researcher needs to draw a straight line through the points in such a way that the line represents the relationship. If most of the points fall close to the line it can be said that there is a linear relationship.
If most of the points appear to be scattered randomly with only an impression of a straight line, there is no, or at best, a weak linear relationship (however, there may be some other type of relationship, e.g. a quadratic or exponential one).
Usually, when one variable increases and the other variables also increases, it can be said that there is a positive linear relationship. When the two variables tend to move in opposite directions, the nature of their association is described as a negative linear relationship.
However, if two variables are linearly related, it does not mean that one is causing the other. In fact, it can never be concluded that one variable causes another variable. Thus, correlation is not causation.
CHAPTER C: GRAPHICAL PRESENTATIONS
C.1 Graphical excellence
Graphical excellence is a term which applies to techniques that are informative and concise and that communicate information clearly to their viewers.
Graphical excellence is achieved when the following characteristics apply:
The graph presents large data sets concisely and coherently.
The ideas and concepts the researcher wants to deliver are clearly understood by the viewer.
The graph encourages the viewer to compare two or more variables.
The display induces the viewer to address the substance of the data and not the form of the graph.
There is no distortion of what the data reveal.
C.2 Graphical deception
Researchers should be aware of possible methods of graphical deception. Firstly, a researcher should be careful about graphs without a scale on one axis. Secondly, he or she should also avoid being influenced by a graph’s caption. Additionally, perspective is often distorted if only absolute changes in value, rather than percentage changes, are reported. For instance, 15% growth in revenues can be made to appear more dramatic by stretching the vertical axis – a technique that involves changing the scale on the vertical axis so that a given euro amount is represented by a greater height than before. As a result, the rise in revenues appears to be greater, because the slope of the graph is visually (but not numerically) steeper. The expanded scale is usually accommodated by employing a break in the vertical axis or by shortening the vertical axis so that the vertical scale start at a point greater than zero. The effect of making slopes appear steeper can also be created by shrinking the horizontal axis, in which case points on the horizontal axis are moved closer together. Just the opposite effect is obtained by stretching the horizontal axis - spreading out the points on the horizontal axis to increase the distance between them so that slopes and trends will appear to be less steep. Similar illusions can be created with bar charts by stretching or shrinking the vertical or horizontal axis. Another method which is used to create distorted impressions with bar charts is to construct the bars so that their widths are proportional to their heights.
Lastly, a researcher should also be careful about size distortions, particularly in pictograms, which replace the bars with pictures of objects to enhance the visual appeal.
In general, preparation of a statistical report should contain the following:
Statement of objectives;
Description of the experiment;
Description of the results; and
Discussion of limitations of the statistical techniques.
CHAPTER D: NUMERICAL DESCRIPTIVE TECHNIQUES
D.1 Central location
Measures of central location are the following:
mean,
median, and
mode.
The arithmetic mean is also referred to as the mean or average. The mean is computed by summing the observations and dividing by the number of observations. The observations in a sample are labeled x1, x2, … , xn where x1 is the first observation, x2 is the second, and so on until xn, where n is the sample size. The sample mean is denoted Figuur 22. In a population, the number of observations is labeled N and the population mean is denoted by μ.
Population mean: Figuur 1
Sample mean: Figuur 2
The median is calculated by placing all the observations in order (ascending or descending). The observation that falls in the middle is the median. However, when there is an even number of observations, the median is determined by averaging the two observations in the middle. The sample and population medians are computed in the same way. When there is a relatively small number of extreme observations (either very small or very large), the median usually produces a better measure of the center location than the mean.
The mode is the observation that occurs with the greatest frequency. Both the statistic (sample mode) and parameter (population mode) are computed in the same way. However, for populations and large samples, it is preferable to report the modal class. There are two main disadvantages of using the mode as a measure of central location: in a small sample it may not be a very good measure and it may not be unique.
When the data are interval, any of the three measures of central location can be used. However, for ordinal and nominal data the calculation of the mean is not valid. Because the calculation of the median begins by placing the data in order, this statistic is appropriate for ordinal data. The mode, which is determined by counting the frequency of each observation, is appropriate for nominal data.
When to compute the Mean:
Interval Data
Descriptive measurement of central location
When to compute the Median:
Ordinal data or interval data (with extreme observations)
Descriptive measurement of central location
When to compute the Mode:
Nominal, ordinal, or interval data.
D.2 Variability
Measures of variability are the following:
Range,
Variance,
Standard deviation, and
Coefficient of variation.
Range = Largest observation – Smallest observation
The advantage of the range is simplicity; however, because the range is calculated from only two observations, it provides no information about the other observations.
Population variance: Figuur 3
Sample variance: Figuur 4
To compute the sample variance s2, the sample mean Figuur 22 has to be calculated first. Next the difference (also called the deviation) between each observation and the mean has to be calculated. Then, the deviations are squared and summed. Finally, the sum of squared deviations is divided by n – 1.
Note: some of the deviations are positive and some are negative. When added together, the sum is 0. This will always be the case because the sum of the positive deviations will always equal the sum of the negative deviations. Consequently, the deviations are squared to avoid the “cancelling effect.”
Shortcut method for calculating sample variance:
Figuur 5
The variance provides only a rough idea about the amount of variation in the data. However, this statistic is useful when comparing two or more sets of data of the same type of variable.
If the variance of one data set is larger than that of a second data set, it can be said the observations in the first set display more variation than the observations in the second set.
The standard deviation is simply the positive square root of the variance (note: the unit associated with the standard deviation is the unit of the original data set).
Population standard deviation: σ = √σ 2
Sample standard deviation: s = √s2√
Knowing the mean and standard deviation allows a researcher to extract useful bits of information. The information depends on the shape of the histogram. If the histogram is bell shaped, the Empirical Rule can be used.
EMPERICAL RULE
Approximately 68% of all observations fall within one standard deviation of the mean.
Approximately 95% of all observations fall within two standard deviations of the mean.
Approximately 99.7% of all observations fall within three standard deviations of the mean.
A more general interpretation of the standard deviation is derived from Chebysheff’s Theorem, which applies to all shapes of histograms.
CHEBYSHEFF’S THEOREM
The proportion of observations in any sample or population that lie within k standard deviations of the mean is at least
1- (1/k2) or K> 1
For instance, when k = 2, Chebysheff’s Theorem states that at least three-quarters (75%) of all observations lie within two standard deviations of the mean. With k = 3, Chebysheff’s Theorem states that at least eight-ninths of all observations lie within three standard deviations of the mean.
The Empirical Rule provides approximate proportions, whereas Chebysheff’s Theorem provides lower bounds on the proportions contained in the intervals.
The coefficient of variation of a set of observations is the standard deviation of the observations divided by their mean.
Population coefficient of variation: CV= σ/μ
Sample coefficient of variation: Figuur 6
The measures of variability described above can be used only for interval data. None of the above can be used to describe the variability of ordinal data. There are no measures of variability for nominal data.
D.3 Relative standing and box plots
Measures of relative standing are designed to provide information about the position of particular values relative to the entire data set. The median, which is a measure of central location, is also a measure of relative standing.
The Pth percentile is the value for which P percent are less than that value and
(100 – P)% are greater than that value. There are special names for the 25th, 50th, and 75th percentiles. Because these three statistics divide the set of data into quarters, these measures of relative standing are also called quartiles. The first or lower quartile is labeled Q1 and it is equal to the 25th percentile. The second quartile, which is labeled Q2, is equal to the 50th percentile, which is also the median. The third or upper quartile, which is labeled Q3, is equal to the 75th percentile. Besides quartiles, percentiles can also be converted into quintiles and deciles. Quintiles divide the data into fifths, and deciles divide the data into tenths.
The following formula allows approximating the location of any percentile (Lp is the location of the Pth percentile). Lp =( n+ 1) ( P / 100)
Usually, the shape of the histogram can be predicted from the quartiles. For instance, if the first and second quartiles are closer to each other than are the second and third quartiles, the histogram is positively skewed. If the first and second quartiles are farther apart than the second and third quartiles, the histogram is approximately symmetric. The box plot is particularly useful in this regard.
The quartiles can be used to create another measure of variability, the interquartile range, which is expressed as follows:
Interquartile range = Q3 – Q1
The interquartile range measures the spread of the middle 50% of the observations. Large values of this statistic mean that the first and third quartiles are far apart, indicating a high level of variability.
The box plot, another graphical technique, graphs five statistics, the minimum and maximum observations, and the first, second, and third quartiles. It also depicts other features of a set of data. The box plot is particularly useful when comparing two or more data sets. An example of the box plot is presented in the Figure 4.1 on page 119. The three vertical lines of the box are the first, second, and third quartiles. The lines extending to the left and right are called whiskers. Any points that lie outside the whiskers are called outliers. The whiskers extend outward to the smaller of 1.5 times the interquartile range or to the most extreme point that is not an outlier.
Outliers are unusually large or small observations. Because an outlier is considerably removed from the main body of the data set, its validity is suspect.
Consequently, outliers should be checked to determine that they are not the result of an error in recording their values. Outliers can also represent unusual observations that should be investigated.
Because the measures of relative standing are computed by ordering the data, these statistics are appropriate for ordinal as well as for interval data. Furthermore, because the interquartile range is calculated by taking the difference between the upper and lower quartiles, it too can be employed to measure the variability of ordinal data.
D.4 Linear relationship
The measures of linear relationship are the following:
covariance,
coefficient of correlation, and
coefficient of determination.
Population covariance: figuur 7
Sample covariance: figuur 8
A shortcut calculation for sample covariance is:
figuur 9
Let’s assume that as x increase so does y. When x is larger than its mean, y is at least as large as its mean. Thus figuur 10 and figuur 11 have the same sign or 0. Their product is also positive or 0. Consequently, the covariance is a positive number. Usually, when two variables move in the same direction (both either increase or decrease), the covariance will be a large positive number.
Let’s assume that as x increases, y decreases. When x is larger than its mean, y is less than or equal to its mean. As a result when is figuur 10 ositive, figuur 11 is negative or 0. Their products are either negative or 0. It follows that the covariance is a negative number. Generally, when two variables move in opposite directions, the covariance is a large negative number.
Let’s assume that as x increase, y does not exhibit any particular direction. One of the products figuur 10 figuur 11 is 0, one is positive, and one is negative. The resulting covariance is a small number. In general, when there is not particular pattern, the covariance is a small number.
The coefficient of correlation is defined as the covariance divided by the standard deviation of the variables.
Population coefficient of correlation: p=(σxy)/σxσy)
Sample coefficient of correlation: r= (S xy) / (sx sy)
The coefficient of correlation has a set lower and upper limit. The limits are -1 and +1, respectively. That is,
-1 i≤ r i≤ + 1 and ≤ -1 i≤ p i≤ + 1
When the coefficient of correlation equals -1, there is a negative linear relationship and the scatter diagram exhibits a straight line.
When the coefficient of correlation equals +1, there is a perfect positive relationship.
When the coefficient of correlation equals 0, there is no linear relationship.
All other values of correlation are judged in relation to these three values.
The drawback to the coefficient of correlation is that except for the three values -1, 0, and +1, researchers cannot interpret the correlation.
The least squares method produces a straight line drawn through the points so that the sum of squared deviations between the points and the line is minimized. The line is represented by the equation:
Ŷ = b0 + b 1x
Where b0 is the y-intercept (where the line intercepts the y-axis), and b1 is the slope (defined as rise/run), and ŷ is the value of y determined by the line. The coefficient b0 and b1 are derived using the following calculus:
figuur 12
Least squares line coefficients:
b 1 = ( s xy)/ (s 2x )
figuur 13
The coefficient of determination measures the amount of variation in the dependent variable that is explained by the variation in the independent variable. The coefficient of determination is calculated by squaring the coefficient of correlation. Thus, it is denoted as R2. For instance, if the coefficient of correlation is -1 or +1, a scatter diagram would display all the points lining up in a straight line. The coefficient of determination is 1, which is interpreted to mean that 100% of the variation in the dependent variable Y is explained by the variation in the independent variable X. If the coefficient of correlation is 0, then there is no linear relationship between the two variables, R2= 0, and none of the variation in Y is explained by the variation in X.
CHAPTER E: PROBABILITY
E.1 Assigning probability to events
A random experiment is an action or process that leads to one of several possible outcomes. An example of an experiment may be flipping a coin. Outcomes of such experiment would be heads and tails.
The first step in assigning probabilities is to produce a list of the outcomes. The listed outcomes must be:
exhaustive, which means that all possible outcomes must be included; and
mutually exclusive, which means that no two outcomes can occur at the same time.
A list of exhaustive and mutually exclusive outcomes is called a sample space and is denoted by S. The outcomes are denoted by O1, o2,...., o k Using set notation the sample space and its outcomes are represented as
O1, o2,...., o k
Given a sample space the probabilities assigned to the outcomes must satisfy two requirements:
The probability of any outcome must lie between 0 and 1. That is,
0≤p(02)≤1 for each i
P(0i) is the notation used to represent the probability of outcomeThe sum of the probabilities of all the outcomes in a sample space must be 1. That is,
figuur14
The three approaches to assigning probabilities are as follows:
The classical approach is used by mathematicians to help determine probability associated with games of chance. Because the sum of the probabilities must be 1, thus, for instance, in the flip of a coin, the probability of heads and the probability of tails are both 50% (0.5).
The relative frequency approach defines probability as the long-run relative frequency with which an outcome occurs. For instance, let’s assume that a researcher knows that of the last 2000 students who took the mathematics course, 200 received a grade of 8. The relative frequency of 8’s is then 200/2000 or 10%. This figure represents an estimate of the probability of obtaining a grade of 8 in the course. It is only an estimate because the relative frequency approach defines probability as the “long-run” relative frequency. Thus, the larger the number of students whose grades the researcher has observed, the better the estimate becomes.
In the subjective approach probability is defined as the degree of belief that a person holds in the occurrence of an event. This approach is used when it is not reasonable to use the classical approach and there is no history of the outcomes.
An individual outcome of a sample space is called a simple event. All other events are composed of the simple events in a sample space. An event is a collection or set of one or more simple events in a sample space. The probability of an event is the sum of probabilities of the simple events that constitute the event. No matter what method was used to assign probability, it should be interpreted by using the relative frequency approach for an infinite number of experiments.
E.2 Joint, marginal and conditional
The intersection of events A and B is the event that occurs when both A and B occur. It is denoted as
A and B
The probability of the intersection is called the joint probability.
Joint probabilities allow researchers to compute various probabilities. Marginal probabilities, computed by adding across rows or down columns, are so named because they are calculated in the margins of the table.
Researchers often need to know how two events are related. In particular, they would like to know the probability of one event given the occurrence of another related event. Such probability is called a conditional probability and it is represented by
P (B1 | A1)
where the “|” represents the word given.
The probability of an event A given event B is
P ( A | B) = p ( A and B) / P (B)
The probability of event B given event A is
P ( B | A) = p ( A and B) / P (A)
One of the objectives of calculating conditional probability is to determine whether two events are related. Two events A and B are said to be independent if
P ( A | B) = p (A )
or
35 P ( B | A) = p (B)
In other words, two events are independent if the probability of one event is not affected by the occurrence of the other event.
The union of events A and B is the event that occurs when either A or B or both occur. It is denoted as
A or B
E.3 Probability rules and trees
There are three rules that enable researchers to calculate the probability of more complex events from the probability of simpler events:
The complement of event A is the event that occurs when event A does not occur. The complement of event A is denoted by Ac The complement rule defined here derives from the fact that the probability of an event and the probability of the event’s complement must sum to 1. Thus, P (Ac) = 1- P ( A) for any eventA.
The multiplication rule is used to calculate the joint probability of two events. It is based on the formula for conditional probability. The joint probability of any two events A and B isP (A andB) =P (B)P (A|B) altering the notation P (A andB) =P (A)P (B|A) If A and B are independent events
, P (A| B) = P (A) and P (B| A) = P (B) It follows that the joint probability of two independent events is simply the product of the probabilities of the two events. This can be expressed in a special form of the multiplication rule: The joint probability of any two independent events A and B is P (A and B) =P (A)P (B)
The addition rule enables researchers to calculate the probability of the union of two events.
The probability that event A, or event B, or both occur is
P (A or B) = P (A) + P (B) – P ( A and B )
The probability of the union of two mutually exclusive events A and B is
P (A or B) = P (A) + P (B)
An effective and simpler method of applying the probability rules is the probability tree, wherein the events in an experiment are represented by lines. Once the tree is drawn and the probabilities of the branches inserted, the only allowable calculation is the multiplication of the probabilities of linked branches. An easy check on those calculations is available.
The joint probabilities at the ends of the branches must sum to 1, because all possible events are listed. An example of the probability tree is presented in Figure 6.1 on page 188.
CHAPTER F: DISCRETE PROBABILITY DISTRIBUTIONS
F.1 Random variables
A random variable is a function or rule that assigns a number to each outcome of an experiment. In some experiments the outcomes are numbers. For instance, when the amount of time needed to complete some activity is measured, the experiment produces events that are numbers. Thus, the value of a random variable is a numerical event.
There are two types of random variables:
A discrete random variable is one that can take on a countable number of values.
A continuous random variable is one whose values are uncountable. One example of a continuous random variable can be the amount of time needed to complete a task. In attempt to count the number of values that X can take on, a researcher needs to identify the next value. However, it is not possible to identify the second, or third, or any other values of X, because, for instance, there is always a value larger than 45 min and smaller than 45.001 min. Thus, it is not possible to count the number of values and X is continuous.
A probability distribution is a table, formula, or graph that describes the values of a random variable and the probability associated with these values. An uppercase letter represents the name of the random variable, usually X. Its lowercase counterpart represents the value of the random variable. Thus, the probability that the random variable X will equal x is represented as P (X = X) or P ( X).
The probabilities of the values of a discrete random variable may be derived by means of probability tools such as tree diagrams or by applying one of the definitions of probability. However, two fundamental requirements apply.
Requirements for a distribution of a discrete random variable are as follows:
0 ≤ P (x) ≤ 1 for all )(
figuur 14
where the random variable can assume values x and p(x)is the probability that the random variable is equal to.
Probability distributions often represent populations. Rather than record each of the many observations in a population, the values and their associated probabilities are listed. These can be used to compute the mean and variance of the population.
The population mean is the weighted average of all its values. The weights are the probabilities. This parameter is also called the expected value of X and is represented by E(X).
Population Mean = figuur 15
The population variance is calculated similarly. It is the weighted average of the squared deviations from the mean.
Population Variance = figuur 16
Shortcut calculation for population variance:
figuur 17
The standard deviation is as defined before:
Population standard Deviation = σ= √ σ2
Researchers often create new variables that are functions of other random variables. The formulas given below allow determining the expected value and variance of these new variables. In the notation used here, X is the random variable and c is a constant.
Laws of expected value
E (c) = c
E (X + c) = E (X) + c
E (cX) = cE (X)
Laws of Variance
V (c ) = 0
V (X + c ) = V (X)
V ( cX) = C2V (X)
F.2 Binomial distribution
The binomial distribution is the result of a binomial experiment, which has the following properties:
The binomial experiment consists of a fixed number of trials. The number of trials is represented by n On each trial there are two possible outcomes. One outcome is labeled a success, and the other one a failure.
The probability of success is p. The probability of failure is 1- p.
The trials are independent, which means that the outcome of one trial does not affect the outcome of any other trials.
If properties 2, 3, and 4 are satisfied, each trial is called a Bernoulli process. Adding property 1 yields the binomial experiment. The random variable of a binomial experiment is defined as the number of successes in the n trials. It is called the binomial random variable.
The binomial random variable is the number of successes in the experiment’s n trails. It can take on values 0,1,2,…,n Thus, the random variable is discrete. To proceed researchers must be capable of calculating the probability associated with each value. Using a probability tree, they can draw a series of branches (and example is presented in Figure 7.2 on page 236). The stages represent outcomes for each of the trials. At each stage there are two branches representing success and failure. To calculate the probability that there are x successes and ntrials, it should be noted that for each success in the sequence researchers must multiply by p. And, if there are x successes, there must be n - x failures. For each failure in the sequence it has to be multiplied by 1- p.
Thus, the probability for each sequence of branches that represent successes and n - x failures has probability
P x (1- p) n-x
There are a number of branches that yield x successes and failures. For instance, there are two ways to produce exactly one success and one failure in two trials – SF and FS. To count the number of branch sequences that produce successes and failures, the combinatorial formula has to be used:
C n x = n! / x! (n-x)!
where n!= n(n-1)(n- 2)....(2)(1) For instance, 3(2)(2)(1) = 6 However,0! = 1.
The two components of the probability distribution, put together, result in the following:
Binominal Probability Distribution
The probability of )( successes in a binominal experiment with n trials and probability of success = p is
P (x) = ( n!) / (x! (n-x)!) P x (1-P) n-x for x = 0,1,2,.....,n
The formula of the binomial distribution allows researchers to determine the probability that X equals individual values. There are many circumstances where they wish to find the probability that a random variable is less than or equal to a value.
That is, they want to determine P(X < x), where x is that value. Such a probability is called a cumulative probability.
Table 1 in Appendix B provides cumulative binomial probabilities for selected values of n and p. Let’s say that there is a 10 trials, the probability of success is 1/5 (.2) and a researcher needs to find P (X < 4) In the Table 1, he can find n =10 and in then find P = 20 . The values in that column are for P (X < x)
x = 0,1,2,3,....,10
The table can also be used to determine probabilities of type P ( X ≥ x) and P ( X = x )
Using Table 1 to Find the Binomial Probability P (X ≥ x )
P ( X ≥ x ) = 1- P (X < [x – 1]) P ( X < [x – 1] )
Using table 1 to find the binomial probability P ( X = x)
P (x)= P (X < x) – P (X < [x – 1 ] )
General formulas for the mean, variance, and standard deviation of a binomial random variable are:
µ = np
σ2=np(1-p)
σ=√np(1-p)
CHAPTER G: CONTINOUS PROBABILITY DISTRIBUTIONS
G.1 Density functions
A continuous random variable is one that can assume an uncountable number of values. Because this type of random variable is so different from a discrete variable, it is treated completely differently. First, it is not possible to list the possible values because there is an infinite number of them. Second, because there is an infinite number of values, the probability of each individual value is virtually 0. Consequently, the probability of only a range of values can be determined. To illustrate how this is done, a histogram should be considered. If the histogram is drawn with a large number of small intervals, the edges of the rectangles can be smoothed to produce a curve. In many cases it is possible to determine a function ƒ(x) that approximates the curve. The function is called a probability density function (an example of density function is presented in Figure 8.3 on page 255).
The following requirements apply to a probability density function ƒ (x) whose range is a < x < b
ƒ (x) ≥ 0 for all )(between a and b.
The total area under the curve between a and b is 1.0.
To illustrate how to find the area under the curve that describes a probability density function, the uniform probability distribution, also called the rectangular probability distribution should be considered.
The uniform distribution is described by the function
ƒ(x) = 1/ b-a a < x < b
An example of such function is presented in figure 8.4 on page 256.
To calculate the probability of any interval, a researcher has to find the area under the curve. For instance, to find the probability that X falls between x1and x2, he or she has to determine the area in the rectangle whose base is x2 – x1 and whose height is1/ (b - a) P ( x1 < X < x2) = Basex Height = (x2 – x1) x1 / (b – a)
A continuous distribution is frequently used to approximate a discrete one when the number of value the variable can assume is countable but large.
G.2 Normal distribution
The normal distribution is the most important of all probability distributions because of its crucial role in statistical inference.
Normal Density Function
The probability density function of a normal random variable is
Where e = 2.71828...and п = 3.14159...
The curve in a normal distribution is symmetric about its mean and the random variable ranges between -∝ and +∝ The normal distribution is described by two parameters, the mean µ and the standard deviation σ
To calculate the probability that a normal random variable falls into any interval, the area in the interval under the curve must be computed. A probability table similar to Tables 1 and 2 in Appendix B used to calculate binomial and Poisson probabilities, respectively, will be used. To determine binominal probabilities from Table 1 probabilities for values of n and a separate column for selected values of p were needed. Similarly, to find Poisson probabilities, a separate column for each value of µ chosen was needed to be included in Table 2. To reduce the number of tables needed in calculating normal probabilities, the researchers standardize the random variable. A random variable is standardized by subtracting its mean and dividing by its standard deviation. When the variable is normal, the transformed variable is called a standard normal random variable and is denoted by Z. That is,
Z = ( X- µ ) / σ
The probability statement about X is transformed by this formula into a statement about Z. Example 8.2 on page 261 illustrates how to process this formula.
The values of Z specify the location of the corresponding value of X . A value of Z-1 corresponds to a value of X that is 1 standard deviation above the mean. Note: the mean of Z which is 0, corresponds to the mean of X
If the mean and standard deviation of a normally distributed random variable are known, researchers can transform the probability statement about in X to a probability statement about Z Consequently, only one table - Table 3 in Appendix B, the standard normal probability table, is needed. The table lists cumulative probabilities P (Z < z) for values ranging from -3.09 to +3.09. To use the table the value of z has to be found and the probability can be just read. For instance,
P(Z < 3.00) is found be finding 3,0 in the left margin, and under the heading .00.
The probability P(Z < 2.13) is found in the row 2.1 and under the heading .03.
It is also possible to determine the probability that the standard normal variable is greater than some value of z. For instance, a researcher can find the probability that Z is greater than 2.19 by determining the probability that Z is less than 2.19 and subtracting that value from 1.
Note: any areas beyond 3.10 are approximated as 0. Thus,
P ( Z > 3.10 ) = P (Z < -3.10) ≈ 0
Some statistical problems require researchers to determine the value of given a Z probability. The notation Z A is used to represent the value of z such that the area to its right under the standard normal curve is A
Thus is a value of a standard normal random variable such that
P (Z > Z A) = A
To find for any value of researchers must use the standard normal table backward. To use the table backward, they have to specify a probability and then determine the z-value associated with it. For instance, to find Z.20 a researcher has start with determining the area less thanZ.20, which is 1 - .020 = .9800. Consequently, he or she has to search through the probability part of the table looking for .9800. When it located, the z-value associated with it can be read out of table. Note: If the probability cannot be found in the table, the closest value should be read. If two values are equally close, the average of them should be taken.
Percentiles are measures of relative standing. The values of are the 100(1 – A)th percentiles of a standard normal random variable. For instance, Z.01 = 2.33, which means that 2.33 is the 99th percentile; 99% of all values of Z are below it and 1% are above it.
G.3 Other continuous distributions
Student t Density Function
The density function of the Student t distribution is as follows:
figuur 19
where V is the parameter of the Student t distribution called the degrees of freedom, п = 3.14159... and r is the gamma function.
The mean and variance of a Student t random variable are
E(t) = 0
And
V(t) = V/ v-2 for v > 2
This distribution is similar to the standard normal distribution. Both are symmetrical about 0. (Both random variables have a mean of 0.) However, the Student t distribution is described as mound shaped, whereas the normal distribution is bell shaped.
For each value of v (the number of degrees of freedom), there is a different Student t distribution. If researchers wanted to calculate probabilities of the Student t random variable manually they would need a different table for each ﬠ Thus, this is usually computed with an aid of computers.
The Student t distribution is used extensively in statistical interference. Researchers often need to find values of the random variable for inferential methods. Table 4 in Appendix B lists values of t A,v which are the values of a Student t random variable with degree S of freedom such that
P (t > t A,v )= A
Note: t A,v is provided for degrees of freedom ranging from 1 to 200 and )( o read this table, the degrees of freedom have to be identified and then that value or the closest number to it can be found. Then, the column representing the t A,v value can be found.
Chi-squared Density Function
The chi-squared density function is
figuur 20
The parameteris the number of degrees of freedom which, like the degrees of freedom of the Student t distribution, affects the shape.
A chi-squared distribution is positively skewed ranging between 0 and ∝ Like the Student t distribution, its shape depends on its number of degrees of freedom.
The mean and variance of a chi-squared random variable are:
E (x2 ) = v
and
V (x2 ) = 2v
The value of x2 degrees of freedom such that the area to its right under the chi-squared curve is equal to is denoted - x2a,v It is not possible to use - x2a,v to represent the point such that the area to its left because S is always greater than 0. To represent left-tail critical values, note that if the area to the left of a point is A, the area to its right must be 1 - A because the entire area under the chi-squared curve (as well as all continuous distributions) must equal 1. Thus denotes the point such that the area to its left is.
Table 5 in Appendix B lists critical values of the chi-squared distribution for degrees of freedom equal to 1 to 30, 40, 50 , 60, 70, 80, 90, and 100. For instance, to find the point in a chi-squared distribution with 10 degrees of freedom such that area to its right is .05, a researcher has to locate 10 degrees of freedom in the left column and x2.050 across the top. The intersection of the row and column contains the number sought for.
To find the point in the same distribution such that the area to its left is .05, the researcher has to find the point such that the area to its right is .95. He has to locate x2.950 across the top row and 10 degrees of freedom down the left column.
For values of degrees of freedom greater than 100, the chi-squared distribution can be approximated by a normal distribution with µ = ν and σ = √2ν.
The density function of the F distribution is:
F Density Function
figuur 21
where F ranges from 0 to ∞ and ν1 and ν2 are the parameters of the distribution called degrees of freedom. ν1 is called the numerator degrees of freedom and ν2 is called the denominator degrees of freedom.
The mean and variance of an F random variable are
E (F) = (v2 ) / (V2 – 2) V2 > 2
and
V (F) = ( 2v22 (v1 +v 2 - 2) / v1 (v 2 - 2) 2 (v2 - 4) v2 > 4
Note: the mean depends only on the denominator degrees of freedom and for large v2, the mean of the F distribution is approximately 1. The F distribution is positively skewed. Its actual shape depends on the two numbers of degrees of freedom. Fa,v1,v2 is defined as the value of F with V1 and v2 degrees of freedom such that the area to its right under the curve is A. Thus,
P ( F > F a,v1,v2) = A
Because the F random variable like the chi-squared can equal only positive values, is defined as the value such that the area to its left is A. Table 6 in Appendix B provides values of for A= 05,025,01,and.005 Values of F1 – A,v1,v2 are unavailable. However, they are not needed because be determined from Thus,
F 1-A,v1,v2 = 1 / F a,v1,v2
To determine any critical value, researchers have to find the numerator degrees of freedom v1 across the top of Table 6 and the denominator degrees of freedom v2 down the left column. The intersection of the row and the column contains the number sought for. Note: the order in which the degrees of freedom appear is important.
CHAPTER H: SAMPLING DISTRIBUTIONS
H.1 Sampling distribution of the mean
There are two ways to create a sampling distribution:
Samples can be actually drawn of the same size from a population, the statistic of interest can be calculated, and then descriptive techniques can be used to learn more about the sampling distribution.
The second method relies on the rules of probability and the laws of expected value and variance to derive the sampling distribution.
Let’s assume that there are two dices. The population is created by throwing a fair die infinitely many times, with the random variable X indicating the number of spots showing on any one throw. The population is infinitely large, because the dice can be thrown infinitely many times. From the definitions of expected value and variance presented, researchers can calculate the population mean, variance, and standard deviation.
Population mean:μ = Σ xP (x)
Population variance: σ 2 = Σ (x- μ) 2 P (x)
Population standard deviation: σ = √σ 2
The sampling distribution is created by drawing samples of size 2 from the population. Thus, two dice are tossed. In this process the mean is computed for each sample. Because the value of the sample mean varies randomly from sample to sample, figuur 22 can be regarded as a new random variable created by sampling. Mean of the sampling distribution of
figuur 23
Note: the mean of the sampling distribution of is equal to the mean of the population of the toss of a die computed previously.
Variance of the sampling distribution of
figuur24
Distribution of is different from the distribution. However, the two random variables are related. Their means are the same figuur 25 and their variances are related figuur 142 . σ (2) / X = σ 2 / 2
μ and S2 are the parameters of the population of. To create the sampling distribution of, a researcher had to repeatedly drew samples of size n= 2 from the population and calculate or each sample.
Thus, is treated as a brand-new random variable, with its own distribution, mean, and variance. The mean is denoted figuur 26 the variance is denoted σ ( 2/ x)
For each value of N he mean of the sampling distribution of is the mean of the population from which the sample is taken. Thus, μx = μ
The variance of the sampling distribution of the sample mean is the variance of the population divided by the sample size. That is,
σ ( 2/ x = (σ 2 / n)
The standard deviation of the sampling distribution is called the standard error of the mean. That is,
figuur 27
The variance of the sampling distribution X is less than the variance of the population researchers are sampling from all sample sizes. Thus, a randomly selected value of (the mean of the number of spots observed in, say, 8 throws of the dice) is likely to be closer to the mean value than is a randomly selected value off (the number of spots observed in one throw). The sampling distribution of becomes narrower (or more concentrated about the mean) as n creases. Also, as gets larger, the sampling distribution of becomes increasingly bell shaped. This phenomenon is summarized in the central limit theorem:
The sampling distribution of the mean of a random sample drawn from any population is approximately normal for a sufficiently large sample size. The larger the sample size, the more closely the sampling distribution of will resemble a normal distribution.
The accuracy of the approximation alluded to in the central limit theorem depends on the probability distribution of the population and on the sample size. If the population is non normal, then is approximately normal only for larger values of . However, if the population is extremely non normal, the sampling distribution will also be non normal even for moderately large values of
The mean of the sampling distribution is always equal to the mean of the population and the standard error is equal to σ/√n for infinitely large populations. However, if the population is finite the standard error is:
figuur 28
Where is the population size and (N – n / N – 1) is called the finite population correction.
If the population size is large relative to the sample size, the finite population correction factor is close to 1 and can be ignored. As a rule of thumb, any population that is at least 20 times larger than the sample size is treated as large.
H.2 Sampling distribution proportion
Sampling Distribution of the Sample Mean
figuur 29
σ (2/ x) = σ 2 / n and figuur 30
If X is non normal, is approximately normal for sufficiently large sample sizes. The definition of “sufficiently large” depends on the extent of non normality of X.
The sampling distribution can be used to make inferences about population parameters. Previously, in order to compute binomial probabilities, it was assumed that P is known. However, in the real world is unknown, requiring the researcher to estimate its value from a sample. The estimator of a population proportion of successes is the sample proportion. Thus, researchers have to count the number of successes in a sample and compute
figuur 31
where X is the number of successes and the sample size. When a sample of size is taken, a researcher is actually conducting a binomial experiment and as a result X is binomially distributed. Thus, the probability of any value of figuur 32 can be calculated from its value of X
Using the laws of unexpected value and variance, the mean, variance, and standard deviation of can be determined.
Sampling distribution of a Sample Proportion
is approximately normally distributed provided that figuur np n ( 1- p) are greater than or equal to 5
The expected value: figuur 33
The variance: figuur 34
The standard deviation: σ p = √ p ( 1-p) / n
(The standard deviation is called the standard error of the proportion.)
H.3 Sampling Distribution of the Difference between two sample means.
In the sampling process, independent random samples should be drawn from each of two normal populations. The samples are said to be independent if the selection of the members of one sample is independent of the selection of the members of the second sample.
The central limit theorem, which states that in repeated sampling from a normal population whose mean is μ and whose standard deviation is σ, the sampling distribution of the sample mean is normal with mean and standard deviation σ / √n.The difference between two independent normal random variables is also normally distributed. Thus, the difference between two sample means figuur 35 is normally distributed if both populations are normal. Through the use of the laws of expected value and variance, the expected value and variance of the sampling distribution ofcan be derived:
figuur 36
and
figuur 37
Thus, in repeated independent sampling from two populations with means and and standard deviations σ or S2, respectively, the sampling distribution of is normal with me
figuur 38
and standard deviation (which is the standard error of the difference between two means)
figuur 39
If the populations are nonnormal, then the sampling distribution is only approximately normal for large sample sizes. The required sample size depends on the extent of nonnormality. However, for most populations, sample sizes of 30 or more are sufficient.
H.4 From here to inference
The primary function of the sampling distribution is statistical inference. In applying probability and sampling distributions, we must know the value of the relevant parameters (in the real world, parameters are almost always unknown). Statistical inference addresses this problem. From chapter 10 on, we will assume that most population parameters are unknown.
CHAPTER I: ESTIMATION
I.1 Concepts of estimation
The objective of estimation is to determine the approximate value of a population parameter on the basis of a sample statistic. For instance, the sample mean is used to estimate the population mean. The sample mean is referred to as the estimator of the population mean. Once the sample mean has been computed, its value is called the estimate.
Sample data can be used to estimate a population parameter in two ways:
By using a point estimator, or
By using an interval estimator.
A point estimator draws inferences about a population by estimating the value of an unknown parameter using a single value or point. There are three drawbacks to using this estimator:
it is virtually certain that the estimate will be wrong. (The probability that a continuous random variable will equal a specific value is 0.)
generally, researchers need to know how close the estimator is to the parameter.
point estimators do not have the capacity to reflect the effects of larger sample sizes.
An interval estimator draws inferences about a population by estimating the value of an unknown parameter using an interval. The interval estimator is affected by the sample size.
An unbiased estimator of a population parameter is an estimator whose expected value is equal to that parameter. In other words, if a researcher was to take an infinite number of samples and calculate the value of the estimator in each sample, the average value of the estimator would equal the parameter. Thus, on average, the sample statistic is equal to the parameter.
The sample mean figuur 1 is an unbiased estimator of the population mean µ. It was stated before that .Figuur 2 The sample proportion is an unbiased estimator of the population proportion because Figuur 3. The difference between two sample means is an unbiased estimator of the difference between two population means because .
Figuur 4 Previously the sample was variance as:
Figuur 5
The reason for choosing to divide by n – 1 is to make E (s2)= σ2 that this definition makes the sample variance an unbiased estimator of the population variance.
If the sample variance was defined using n in the denominator, the resulting statistic would be a biased estimator of the population variance, one whose expected value is less than the parameter. If an estimator is unbiased researchers can be sure that its expected value equals the parameter; however, it does not say how close the estimator is to the parameter. Another desirable quality is consistency - an unbiased estimator is said to be consistent if the difference between the estimator and the parameter grows smaller as the sample size grows larger.
The measure which is used to estimate closeness is the variance (or the standard deviation). Thus, Figuur 1 is a consistent estimator of μ, because the variance Figuur 1 of is σ2 / n This implies that as n grows larger, the variance of Figuur 1 grows smaller. As a consequence, an increasing proportion of sample means falls close to μ .
Similarly, Figuur 6 is a consistent estimator of P because it is unbiased and the variance of Figuur 6 is, P (1- p) / n which grows smaller as n grows larger.
The last desirable quality is relative efficiency, which compares two unbiased estimators of a parameter: If there are two unbiased estimators of a parameter, the one whose variance is smaller is said to be relatively more efficient.
I.2 Estimate population mean
Confidence Interval Estimator of µ
Figuur 7
The probability 1 – a is called the confidence level.
Figuur 8 is called the lower confidence limit (LCL).
Figuur 9 is called the upper confidence limit (UCL).
The confidence interval estimator is often represented as:
Figuur 10
where the minus sign defines the lower confidence limit and the plus sign defines the upper confidence limit.
To apply this formula a researcher has to specify the confidence level, 1 – a from which he or she can determine a, a/2, za/2 (from Table 3 in Appendix B). Because the confidence level is the probability that the interval includes the actual value of μ , researchers usually set 1 – a close to 1 (i.e. between .90 and .99)
The four most commonly used confidence levels and their associated value of zα/2 are listed:
1 – α zα/2
.90 z.05 = 1.645
.95 z.025 = 1.96
.98 z.01 = 2.33
.99 z.005 = 2.575
Let’s assume that the confidence level is 1 – α = .98. Consequently, α = .02, α/2 = .01, and zα/2 = z.01 = 2.33. The resulting confidence interval estimator is called the 98% confidence interval estimator of µ.
The confidence interval estimate of µ cannot be treated as a probability statement about µ. However, the confidence interval estimator is a probability statement about the sample mean. It states that there is 1 – α probability that the sample mean will equal to a value such that interval Figuur 1 will include the population mean. Once the sample mean is computed, the interval acts as the lower and upper limits of the interval estimate of the population mean.
The width of the confidence interval estimate is a function of the population standard deviation, the confidence level, and the sample size. Thus:
Doubling the population standard deviation has the effect of doubling the width of the confidence interval estimate.
Decreasing the confidence level narrows the interval; increasing it widens the interval. Note: a large confidence level is generally desirable since that means a larger proportion of confidence interval estimates that will be correct in the long run. As a general rule 95% confidence is considered “standard.”
Increasing the sample size fourfold decreases the width of the interval by half. A larger sample size provides more potential information. The increased amount of information is reflected in a narrower interval.
I.3 Selecting sample size
Sampling error is the difference between the sample and the population that exists only because of the observations that happened to be selected for the sample. The sampling error can also be defined as the difference between an estimator and a parameter. This difference can also be defined as the error of estimation. This can be express as the difference between Figuur 1 and µ.
The error of estimation is less than Zα/2 σ/√n. It means that Zα/2 σ/√n is the maximum error of estimation that a researcher is willing to tolerate. This value is labeled B, which stands for the bound on the error of estimation. That is,
B = Z a/2 ( σ) / ( √n )
The equation for n can be solved if the population standard deviation σ, the confidence level 1 – α, and the bound on the error of estimation B are known.
Thus,
Sample size to Estimate a Mean n = ( z a/2 σ/ B) 2
CHAPTER J: HYPOTHESIS TESTING
J.1 Concepts of hypothesis testing
Hypothesis testing is the second general procedure of making inferences about a population
The most important concepts in hypothesis testing are the following:
There are two hypotheses. One is called the null hypothesis and is represented by H0 and the other is called the alternative or research hypothesis and is represented by H1. The null hypothesis will always state that the parameter equals the value specified in the alternative hypothesis
The testing procedure begins with the assumption that the null hypothesis is true.
The goal of the process is to determine whether there is enough evidence to infer that the alternative hypothesis is true.
There are two possible decisions:
Conclude that there is enough evidence to support the alternative hypothesis.
Conclude that there is not enough evidence to support the alternative hypothesis.
There are two possible errors. A Type I error occurs when a true null hypothesis is rejected. A Type II error is defined as not rejecting a false null hypothesis. The probability of a Type I error is denoted by α, which is also called the significance level. The probability of a Type II error is denoted by β. The error probabilities α and β are inversely related, meaning that any attempt to reduce one will increase the other.
P(Type I error) = α
P(Type II error) = β
The hypotheses are often set up to reflect a manager’s decision problem while the null hypothesis represents status quo. The next step in the process is to randomly sample the population and calculate the sample mean. This is called the test statistic. The tests statistics is the criterion upon which researchers base decision about the hypotheses. The test statistic is based on the best estimator of the parameter (i.e. the best estimator of a population mean is the sample mean).
If the test statistic’s value is inconsistent with the null hypothesis, researchers reject the null hypothesis and infer that the alternative hypothesis is true. In the absence of sufficient evidence, the null hypothesis cannot be rejected in favour of the alternative.
It seems reasonable to reject the null hypothesis in favour of the alternative if the value of the sample mean is large relative to the population mean. If, for instance, the sample mean was calculated to be 300, while the population mean was expected to be 100, it would be quite apparent that the null hypothesis is false and has to be rejected it.
On the other hand, values of Figuu 1 close to 100, for instance 97, do not allow researchers to reject the null hypothesis because it is entirely possible to observe a sample mean of 97 from a population whose mean is 100. To make a decision about this sample mean, they have to set up the rejection region.
J.2 Testing population mean
The rejection region is a range of values such that if the test statistics falls into that range, a researcher has to reject the null hypothesis in favour of the alternative hypothesis.
To calculate the rejection region, let’s assume that the value of the sample mean is just large enough to reject the null hypothesis as Figuur 11. The rejection region is:
Figuur 12
Since a type I error is defined as rejecting a true null hypothesis, and the probability of committing a Type I error is a, it follows that
Figuur 13
The sampling distribution of Figuur 1 is normal or approximately normal, with mean μ and standard deviation σ / √n As a result, can Figuur 1 be standardized to obtain the following probability:
Figuur 14
Z a is the value of a standard normal random variable such that
p (Z > za) = a
Since both probability statements involve the same distribution (standard normal) and the same probability (a), it follows that the limits are identical. Thus,
Figuur 15
To calculate the rejection region, Figuur 11 a researcher has to have values of sample size (n), standard deviation (σ), population mean (µ), and a value of (a) , the significance level.
Let’s assume that the sample mean was computed to be 110 and the rejection region was calculated to be 105.47. Because the test statistic (sample mean) is in the rejection region (it is greater than 105.47), the null hypothesis has to be rejected.
The preceding test used the test statistic ;Figuur 1 as a result, the rejection region had to be set up in terms of Figuur 1 . An easier method specifies that the test statistic be the standardized value of Figuur 1. That is, researchers can use the standardized test statistic:
figuur 16
and the rejection region consists of all values of that are Z greater than Za . Algebraically the rejection region is
Z > Z a
The standardized test statistic can be used, thus
Z > Z a = z 05 = 1.645 (when the significance level is 5%)
Consequently, the value of the test statistic has to be calculated (using the formula ) Figuur 17 and the result should be compared to the rejection region (1.645 in this case). If the result is greater than 1.645, a researcher has to reject the null hypothesis.
The conclusions drawn from using the test statistic Figuur 1 and the standardized test statistic Z are identical.
When a null hypothesis is rejected, the test is said to be statistically significant at whatever significance level the test was conducted. For instance, it can be said that the test was significant at the 5% significance level.
The p-value of a test is the probability of observing a test statistic at least as extreme as the one computed given that the null hypothesis is true. An example of calculating p-value can be seen below.
Figuur 18
In this case the probability of observing a sample mean at least as large as 178 from a population whose mean is 170 is .0069, which is very small. In other words, it is an unlikely event. If the hypothesis is H0 = 170, such p-value gives a reason to reject the null hypothesis and support the alternative.
The p-value of a test provides valuable information because it is a measure of the amount of statistical evidence that supports the alternative hypothesis.
How small does the p-value have to be to infer that the alternative hypothesis is true, depends on a number of factors, including the costs of making Type I and Type II errors. If the cost of Type I error is high, researchers attempt to minimize its probability. In the rejection region method, they do so by setting the significance level quite low (e.g. 1%). Using the p-value method, researchers need the p-value to be quite small, providing sufficient evidence to reject the null hypothesis.
p-values can be translated using the following descriptive terms:
If the p-value is less than .01, researchers say that there is overwhelming evidence to infer that alternative hypothesis is true. It can also be said that the test is highly significant.
If the p-value lies between .01 and .05, there is strong evidence to infer that the alternative hypothesis is true. The result is deemed to be significant.
If the p-value is between .05 and .10, researchers say that there is weak evidence to indicate that the alternative hypothesis is true. When the p-value is greater than 5%, it is said that the result is not statistically significant.
When the p-value exceeds .10, it is said that there is no evidence to infer that the alternative hypothesis is true.
The p-value can be used to make the same type of decisions made in the rejection region method. The rejection region method requires the decision maker to select a significance level from which the rejection region is constructed. He then decides to reject or not reject the null hypothesis. Another way of making that type of decision is to compare the p-value with the selected value of the significance level. If the p-value is less than α, the p-value is judged to be small enough to reject the null hypothesis. If the p-value is greater than α, the null hypothesis cannot be rejected.
Conclusion of a Test of Hypothesis
If a researcher rejects the null hypothesis, he or she concludes that there is enough statistical evidence to infer that the alternative hypothesis is true.
If a researcher does not reject the null hypothesis, he or she concludes that there is not enough statistical evidence to infer that the alternative hypothesis is true.
In one-tail tests, the rejection region is located only one tail of the sampling distribution. The p-value is also computed by finding the area in one tail of the sampling distribution.
A two-tail test is conducted whenever the alternative hypothesis specifies that the mean is not equal to the value stated in the null hypothesis, that is, when the hypotheses assume the following form:
H0 : μ = μ 0
H1 : μ = μ 0
There are two one-tail tests. Researchers conduct a one-tail test that focuses on the right tail of the sampling distribution whenever they want to know whether there is enough evidence to infer that the mean is greater than the quantity specified by the null hypothesis, that is, when the hypotheses are:
H0 : μ = μ 0
H1 : μ > μ 0
The second one-tail test involves the left tail of the sampling distribution. It is used when the researcher wants to determine whether there is enough evidence to infer that the mean is less than the value of the mean stated in the null hypothesis. The resulting hypotheses appear in this form:
H0 : μ = μ 0
H1 : μ < μ 0
The test statistic and the confidence interval estimator are both derived from the sampling distribution. The confidence interval estimator figuur 19 can be used to test hypotheses.
This process is equivalent to the rejection region approach. However, instead of finding the critical values of the rejection region and determining whether the test statistic falls into the rejection region, researchers compute the interval estimate and determine whether the hypothesized value of the mean falls in to the interval.
The test of hypothesis is based on the sampling distribution of the sample statistic. The result of a test of hypothesis is a probability statement about the sample statistic. It is assumed that the population mean is specified by the null hypothesis. Researchers then compute the test statistic and determine how likely it is to observe this large (or small) a value when the null hypothesis is true. If the probability is small, it can be concluded that the assumption that the null hypothesis is true is unfounded and it is rejected.
When researchers calculate the value of the test statistic, they are also measuring the difference between the sample statistic Figuur 1 and the hypothesized value of the parameter μ in terms of the standard error σ) / √n For instance, if the value of the test statistic was z = 1.19, this would mean that the sample mean is 1.19 standard errors above the hypothesized value of . μ The standard normal probability table shows that this value is not considered unlikely. As a result the null hypothesis should not be rejected.
J.3 Probability of type II error
A Type II error occurs when a false null hypothesis is not rejected. if figuur 1 is less than rejection region, the null hypothesis is not rejected. The probability of a Type II error is defined as:
Figuur 20 [value of the rejection region], given that the null hypothesis is false)
The condition that the null hypothesis is false, means only that the mean is not equal to μ. If a researcher wants to compute , β he needs to specify a value for μ . To calculate the probability of a Type II error, researchers have to express the rejection region in terms of the unstandardized test statistic , Figuur 1 and they have to specify a value for μ other than the one shown in the null hypothesis.
Effect of Changing a on β
By decreasing the significance level from 5% to 1%, a researcher will shift the critical value of the rejection region to the right and thus enlarge the area where the null hypothesis is not rejected. The probability of a Type II error increases. This illustrates the inverse relationship between the probabilities of Type I and Type II errors. If a researcher wants to decrease the probability of a Type I error (by specifying a small value of a), he will increase the probability of a Type II error. In applications where the cost of a Type I error is considerably larger than the cost of a Type II error, this is appropriate.
In fact, a significance level of 1% or less is probably justified. However, when the cost of a Type II error is relatively large, a significance level of 5% or more may be appropriate.
A statistical test of hypothesis is effectively defined by the significance level and the sample size (both selected by the researcher). A researcher can judge how well the test functions by calculating the probability of a Type II error at some value of the parameter.
If he believes that the cost of a Type II error is high and thus that the probability is too large, he has two ways to reduce the probability. He can either increase the value (a) of (however, this would result in an increase in the chance of making a Type I error) or increase the sample size. By increasing the sample size, a researcher reduces the probability of a Type II error. By reducing the probability of a Type II error, he makes this type of error less frequently. For this reason, larger sample sizes allow researchers to make better decisions in the long run.
Another way of expressing how well a test performs is to report its power: the probability of its leading a researcher to reject the null hypothesis when it is false. Thus, the power of a test is 1- β.
When more than one test can be performed in a given situation, it is preferred to use the test that is correct more frequently. If (given the same alternative hypothesis, sample size, and significance level) one test has a higher power than a second test, the first test is said to be more powerful.
J.4 The road ahead
In the chapters that follow, different statistical techniques are presented employed by statistics practioners. The real challenge of the subject lies in being able to define the problem and identify which statistical method is the most appropriate one to use. Every statistic method has some specific objective (5 addressed in this book: describe a population, compare two populations, compare 2+ populations, analyze the relationship between two variables and analyze the relationship among 2+ variables).
CHAPTER K: HOW TO MAKE INFERENCES ABOUT A POPULATION
K.1 Inference about a population mean
If the population mean is unknown, so is the population standard deviation. Consequently, the previous sampling distribution cannot be used. As an alternative, researchers substitute the sample standard deviation (S) in place of the unknown population standard deviation (S). The result is called a t-statistic. It has been shown showed that the t-statistic defined as
Figuur 21
is Student (t) distributed when the sampled population is normal.
Test statistic for μ when (s) is unknown
When the population standard deviation is unknown and the population is normal, the test statistic for testing hypotheses about μ is
Figuur 22
which is Student t distributed with v = n – 1 degrees of freedom
Confidence interval estimator of μ when σ is unknown
Figuur 23 v = n – 1
The t-statistic is Student t distributed if the population from which a researcher had sampled is normal. However, statisticians have shown that if the population is nonnormal, the results of the t-test and confidence interval estimate are still valid provided that the population is not extremely nonnormal. To check this requirement, a researcher has to draw the histogram and determine whether it is far from bell shaped.
When the population is small, the test statistic and interval estimator have to be adjusted using the finite population correction factor. However, in populations that are large relative to the sample size, the correction factor can be ignored. Large populations are defined as populations that are at least 20 times the sample size.
Finite populations allow researchers to use the confidence interval estimator of a mean to produce a confidence interval estimator of the population total. To estimate the total, it is necessary to multiply the lower and upper confidence limits of the estimate of the mean by the population size. Thus, the confidence interval estimator of the total is:
Figuur 24
The Student t distribution is based on using the sample variance to estimate the unknown population variance. The sample variance is defined as
Figuur 25
To compute S2 , a researcher must first determine .Figuur 1 Sampling distributions are derived by repeated sampling from the same population.
To repeatedly take samples to compute S2 , he or she can choose any numbers for the first n - 1 observations in the sample. However, he or she has no choice on the th n value because the sample mean must be calculated first. Let’s assume N – 3 that and Figuur 26 has to be found.X1 and X2 can assume any values without restriction. However, X3 must be such that Figuur 26. For instance, if ..X1 = 5 and X2 = 11 then X3 must equal 14. Therefore, there are only 2 degrees of freedom in this selection of the sample. It is said that 1 degree of freedom was lost because Figuur 1 had to be calculated. Note: the denominator in the calculation of s2 is equal to the number of degrees in freedom.
The t -statistic like the z -statistic measures the difference between the sample me Figuur 1 an and the hypothesized value of μ in terms of the number of standard errors. However, when the population standard deviation α is unknown, the standard error is estimated by . s / √n
The t -statistic has two variables: the sample mean and Figuur 1 the sample standard deviation s, both of which will vary from sample to sample. Because of the greater uncertainty, the t -statistic will display greater variability.
K.2 Inference about a population variance
The estimator of σ 2 is the sample variance; that is, S2 is an unbiased, consistent estimator of σ 2. It has been shown that the sum of squared deviations from the mean Figuur 27 [which is equal to ] (n- 1) S2 divided by the population variance is chi-squared distributed with v = n – 1 degrees of freedom provided that the sampled population is normal. The statistic
X 2 = ( n-1) S2 /σ 2
is called the chi-squared statistic (C2 -statistic).
The formula that describes the sampling distribution is the formula of the test statistic. The test statistic used to test hypotheses about σ 2 is
X 2 = ( n-1) S2 / σ 2
Which is chi-squared distributed with v = n – 1 degrees of freedom when the population random variable is normally distributed with variance equal to σ 2.
Confidence interval estimator of σ 2
Lower confidence limit (LCL) = ( n-1) S2 / X2a/2
Upper confidence limit (UCL) = ( n-1) S2 / X12a/2
Like the t-test and estimator of μ , the chi-squared test and estimator of theoretic σ 2 ally require that the sample population be normal.
In practice, however, the technique is valid as long as the population is not extremely nonnormal. The extent of nonnormality can be determined by drawing the histogram.
K.3 Inference about a population proportion
The statistic used to estimate and test the population proportion is the sample proportion defined as
Figuur 28
Where Figuur 1 is the number of successes in the sample and is (n) the sample size. The sampling distribution of Figuur 29 is approximately normal with mean (P) and standard deviation √p( 1 – p) / n (provided that (np) and n (1- p)are greater than 5). This sampling distribution is expressed as
Figuur 30
The same formula also represents the test statistic.
Test statistic for (P)
Figuur 31
which is approximately normal for NP and n (1 - p) greater than 5
Confidence interval estimator (P)
Figuur 32
which is valid provided that Figuur 33 and Figuur 34 are greater than 5
To produce the confidence interval estimator of the total, a researcher has to multiply the lower and upper confidence limits of the interval estimator of the proportion of successes by the population size. The confidence interval estimator of the total number of successes in a large finite population is
Figuur 35
Sample size to Estimate a Proportion:
Figuur 36
To solve for , n a researcher has to know Figuur 29 . Unfortunately, this value is unknown, because the sample has not yet been taken. Two methods can be used to solve for n:
If a researcher has no knowledge of even the approximate value of, he Figuur 29 should let Figuur 29 = Figuur 37 is chosen because the product Figuur 38 equals its maximum value at Figuur 37.
This, in turn, results in a conservative value of , and n as a result, the confidence interval will be no wider than the interval . If figuur 39, when the sample is drawn,figuur 29 does not equal .5, the confidence interval estimate will be better (that is, narrower) than planned. If it turns out that = .5, the interval estimate is . Figuur 39 If not, the interval estimate will be narrower. For instance, if it turns out that Figuur 40 , the estimate Figuur 41 is , which is better than planned.
If a researcher some idea about the value Figuur 29 of, he can use that quantity to determine n. For instance, if he believes that Figuur 29 will turn out to be approximately .3, he can solve for n . Note: this produces a smaller value n of (thus reducing sampling costs) than does the previous method. If Figuur 29 actually lies between .3 and .7, however, the estimate will not be as good as wanted, because the interval will be wider than desired.
Wilson Estimators
When applying the confidence interval estimator of a proportion when success is a relatively rare event, it is possible to find no successes, especially if the sample size is small. To illustrate, let’s assume a sample of 100 produced X = 0 which means that . Figuur 42 The 95% confidence interval estimator of the proportion of successes in the population becomes
Figuur 43
This implies that if a researcher does not find successes in the sample, then there is no chance of finding a success in the population. Drawing such a conclusion from virtually any sample size is unacceptable. The Wilson estimate denoted Figuur 44 (pronounced -til P de) is computed by adding 2 to the number of successes in the sample and 4 to the sample size.
Confidence interval estimator of P using the Wilson Estimate
Figuur 45
CHAPTER L: HOW TO MAKE INFERENCES ABOUT COMPARING TWO POPULATIONS
L.1 Inference about the difference between two means: independent samples
In order to test and estimate the difference between two population means, researchers draw random samples from each of two populations. Independent samples are defined as samples completely unrelated to one another. A researcher draws a sample size of size n1 from population 1 and a sample of size n2 from population 2. For each sample, he or she has to compute the sample means and sample variances.
The best estimator of the difference between two population means μ1 - μ2 is the difference between two sample means Figuur 46.
Sampling Distribution of Figuur 46
Figuur 46 is normally distributed if the populations are normal and approximately normal if the populations are nonnormal and the sample sizes are large.
The expected value of Figuur 46 is
Figuur 47
The variance of Figuur 46 is
Figuur48
The standard error of Figuur 46 is
√ ( σ 21 / n1) + (σ 22 / n2)
Thus,
Figuur 49
is a standard normal (or approximately normal) random variable. Consequently, the test statistic is
Figuur 49
And the interval estimator is
Figuur 50
However, these formulas are rarely used because the population variances σ 21 and σ22 are virtually always unknown. Consequently, it is necessary to estimate the standard error of the sampling distribution.
The way to do this depends on whether the two unknown population variances are equal. When they are equal, the test statistic is defined in the following way.
Tests Statistic for μ1 – μ2 when σ21 = σ22
Figuur 51 v = n1 = n2 – 2
where
s 2p = ( n1 – 1) s 21 + (n2 – 1) s 22 / n1 + n2 – 2
The quantity s 2p is called the pooled variance estimator. It is the weighted average of the two sample variances with the number of degrees of freedom in each sample used as weights. The requirement that the population variances be equal makes this calculation feasible, because only one estimate of the common value of S21 an S22 d is needed.
The test statistic is Student t distributed with n1 + n2 – 2 degrees of freedom, provided that the two populations are normal.
Confidence Interval Estimator of μ1 – μ2when σ 21 = σ 22
Figuur 52 v = n1 + n2 – 2
These formulas are referred to as the equal-variances test statistic and confidence interval estimator, respectively.
When the population variances are unequal, researchers cannot use the pooled variance estimate. Instead, they estimate each population variance with its sample variance.
Unfortunately, the sampling distribution of the resulting statistic
Figuur 53
is neither normally nor Student t distributed. However, it can be approximated by a Student t distribution with degree of freedom equal to
The test Figuur 54 atistic and confidence interval estimator are easily derived from the sampling distribution.
Test statistic for :ų1 ų2 when σ 21 = σ 22
Figuur 55 Figuur 56
Confidence interval estimator Ц1 Ц2 when σ 21 = σ 22
Figuur 57 Figuur 58
These formulas are referred to as the unequal-variances test statistic and confidence interval estimator, respectively.
Since σ 21 and σ 22 are unknown, a researcher cannot know for certain whether they are equal. However, he or she can perform a statistical test to determine whether there is evidence to infer that the population variances differ. A researcher has to conduct the F-test of the ratio of two variances.
Testing the population variances
The hypotheses to be tested are
H0 : σ 21 / σ 22 = 1
H1 : σ 21 / σ 22 σ 1
The test statistic is the ratio of the sample variances S21 / S22 which is F-distributed with degrees of freedom v1 = n1 – 1 and v2 = n2 – 2
The required condition for F-distribution is the same as that for the t-test of Ц1 - Ц2 , which is that both populations are normally distributed. This is a two-tail test so that the rejection region is
F > F a/2, v1,v2 or F > F 1-a/2, v1,v2
Thus, a researcher will reject the null hypothesis that states that the population variances are equal when the ratio of the sample variances is large or if it is small. Table 6 in Appendix B lists the critical values of the F distribution and defines “large” and “small.”
L.2 Observational and experimental data
A researcher can never have enough statistical evidence to conclude that the null hypothesis is true. This means that he or she can only determine whether there is enough evidence to infer that the population variances differ. Accordingly, a researcher adopts the following rule: he will use the equal-variances test statistic and confidence interval estimator unless there is evidence (based on the F-test of the population variances) to indicate that the population variances are unequal, in which case he will apply the unequal-variances test statistic and confidence interval estimator.
Both the equal-variances and unequal-variances techniques require that the populations are normally distributed. As before, a researcher can check to see whether the requirement is satisfied by drawing the histograms of the data.
When the normality requirement is unsatisfied, a researcher can use a nonparametric technique – the Wilcoxon rank sum test – to replace the equal-variances test of . Ц1 - Ц2 There is no alternative to the unequal-variances test of Ц1 - Ц2 when the populations are very nonnormal.
The value of the test statistic is the difference between the statistic Figuur 59 and the hypothesized value of the parameter Ц1 - Ц2 measured in terms of the standard error.
As was the case with the interval estimator of P, the standard error must be estimated from the data for all inferential procedures introduced. The method used to compute the standard error of Figuur 59 depends on whether the population variances are equal. When they are equal, researchers calculate and use the pooled variance estimator. S2p Thus, where possible, it is advantageous to pool sample data to estimate the standard error. S2p is a better estimator of the common variance than either S21 or S22 separately. When the two population variances are unequal, a researcher cannot pool the data and produce a common estimator – he must compute S21 and S22 and use them to estimate 21 and 22 respectively.
L.3 Inference about the difference between two means: matched pairs
An experiment may be designed in such a way that each observation in one sample is matched with an observation in the other sample. The matching is conducted by selecting, for instance, economics and marketing majors with similar GPAs. Thus, it is logical to compare the salary offers for them in each group. This type of experiment is called matched pairs. In such experimental design, the parameter of interest is the mean of the population of differences, which we label Цd.
Note: Цd. does in fact equal Ц1 - Ц2 , but researchers test Цd. because of the way the experiment is designed. Therefore, the hypotheses to be tested are:
H0 : Цd = 0
H1 : Цd > 0
Test statistic for Цd :
Figuur 60
which is Student t distributed with v = nD -1degrees of freedom, provided that the differences are normally distributed.
The confidence interval estimator of is Цd. derived using the usual form for the confidence interval.
Confidence interval estimator of Цd.
Figuur 61
The validity of the results of the t-test and estimator of Цd. depends on the normality of the differences (or large enough sample sizes). For instance, the histogram of the differences can be positively skewed, but not enough so that the normality requirement is violated.
If the differences are very nonnormal, t -test of Цd cannot be used. Researchers can, however, use a nonparametric technique – the Wilcoxon signed rank sum test for matched pairs.
Two most principles in statistics are:
The concept of analyzing sources of variation. For instance, by reducing the variation between salary offers in each sample, a researcher is able to detect a real difference between the two majors. This is an application of the more general procedure of analyzing data and attributing some fraction of the variation to several sources. A technique called the analysis of variance analyzes sources of variation in an attempt to detect real differences. In most applications of this procedure, researchers are interested in each source of variation and not simply in reducing one source. The process is referred to as explaining the variation.
Researchers can design data-gathering procedures in such a way that they can analyze sources of variation. The experiment can be organized so that the effects of those differences are mostly eliminated. It is also possible to design experiments that allow for easy detection of real differences and minimize the costs of data gathering.
Researchers make inferences about the ratio of two population variances because the sampling distribution is based on ratios rather than differences.
Two population variances are compared by determining the ratio; thus, the parameter is .21 and 22 The sample variance is an unbiased and consistent estimator of the population variance. The estimator of the parameter 21 and 22is the ratio of the two sample variances drawn from their respective populations . The S21 / S22 sampling distribution of 21 and 22s said to be F distributed provided that researchers have independently sampled from two normal populations. It has been shown that the ratio of two independent chi-squared variables divided by their degrees of freedom is F distributed.
The degrees of freedom of the F distribution are identical to the degrees of freedom for the two chi-squared distributions. (n-1) s2 / 2 chi-squared distributed, provided that the sampled population is normal.
CHAPTER M: STATISTICAL TECHNIQUES INVOLVING NOMINAL DATA
M.1 Chi-squared goodness-of-fit test
A multinomial experiment is an extension of the binomial experiment, in which there are two or more possible outcomes per trial. A multinomial experiment is one possessing the following characteristics:
The experiment consists of a fixed number n of trials.
The outcome of each trial can be classified into one of k categories, called cells.
The probability pi that the outcome will fall into cell i remains constant for each trial. Moreover, .p1 + p2 + …... + p k = 1
Each trial of the experiment is independent of the other trials.
When k = 2, the multinomial experiment is identical to the binomial experiment. In a binomial experiment, researchers count the number of successes (which is labeled x) and failures. In a multinomial experiment, researchers count the number of outcomes falling into each of the k cells. Thus, they obtain a set of observed frequencies ƒ1,ƒ2,......, ƒk where ƒ1, is the observed frequency of outcomes falling into cell i , for . i = 1,2,....., k Because the experiment consists of n trials and an outcome must fall into some cell,
ƒ1,ƒ2,......, ƒk = n
Just as the number of successes x was used to draw inferences about p (by calculating the sample proportion Figuur 6, which is equal to x/n), the observed frequencies are used to draw inferences about the cell probabilities.
If the data are nominal and a researcher is interested in the proportions of all categories the experiment is recognized as a multinomial experiment, and the technique is identified as the chi-squared goodness-of-fit test. Because a researcher wants to know whether the values of each category changed, he needs to specify the initial values in the null hypothesis, e.g. H0 : P1 = 45, p2 = 40, p3 = 15
The alternative hypothesis states: H1: At least one Pi is not equal to its specified value.
In general, the expected frequency for each cell is given by
ei = npi
This expression is derived from the formula for the expected value of a binomial random variable.
If the expected frequencies ei and the observed frequencies a ƒ1 re quite different, it can be concluded that the null hypothesis is false, and is should be rejected. However, if the expected and observed frequencies are similar, the null hypothesis should not be rejected. The following test statistic measures the similarity of the expected and observed frequencies.
Chi-squared Goodness-of-fit Test:
Figuur 62
The sampling distribution of the test statistic is approximately chi-squared distributed with v = k – 1 degrees of freedom, provided that the sample size is large.
When the null hypothesis is true, the observed and expected frequencies should be similar, in which case the test statistic will be small. Thus, a small test statistic supports the null hypothesis. If the null hypothesis is untrue, some of the observed and expected frequencies will differ and the test statistic will be large. Consequently, the null hypothesis should be rejected when X2 is greater than X2a,k-1That is, the rejection region is
X2 - X2a,k-1
The actual sampling distribution of the test statistic defined previously is discrete, but it can be approximated by the chi-squared distribution provided that the sample size is large. This requirement is similar to the one imposed when the normal approximation to the binomial was used in the sampling distribution of a proportion. In that approximation np and n (1 -p)had to be 5 or more.
M.2 Chi-squared tests of a contingency table
A similar rule is imposed for the chi-squared test statistic. It is called the rule of five, which states that the sample size must be large enough so that the expected value for each cell must be 5 or more. Where necessary, cells should be combined to satisfy this condition.
The chi-squared test of a contingency table is used to determine whether there is enough evidence to infer that two nominal variables are related and to infer that difference exists between two or more populations of nominal variables.
The test statistic is the same as the one used to test proportions in the goodness-of-fit-test. That is, the test statistic is
Figuur 62
where K is the number of cells in the cross-classification table. Note: In the goodness-of-fit test, the null hypothesis lists values for the probabilities . Pi The null hypothesis for the chi-squared test of a contingency table states only that the two variables are independent. However, the probabilities are needed to compute the expected values ei , which in turn are needed to calculate the value of the test statistic. The probabilities must come from the data after it is assumed that the null hypothesis is true.
The expected frequency of the cell in row I and column j is
ei = Rowitotal * Column j total / Samplesize
In order to determine the rejection region, a researcher must know the number of degrees of freedom associated with the chi-squared statistic. The number of degrees of freedom for a contingency table with r rows and c columns is v= (r-1) (c-1)
CHAPTER N: REGRESSION AND CORRELATION
Regression analysis is used to predict the value of one variable on the basis of other variables. The technique involves developing a mathematical equation or model that describes the relationship between the variable to be forecast, which is called the dependent variable, and variables that the researcher believes are related to the dependent variable. The dependent variable is labeled as Y, while the related variables are called independent variables and are labeled X1, X2,.....Xk here K the number of independent variables).X1, X2,.....Xk
N.1 Model
Some of the mathematical models related to the statistical concepts are:
Deterministic models, which allow researchers to determine the value of the dependent variable (on the left side of the equation) from the values of the independent variables.
What must be included in most practical models is a method to represent the randomness that is part of a real-life process. Such a model is called a probabilistic model.
The first-order linear model (or the simple linear regression model) is a straight-line model with one independent variable.
First-order linear model
y = β0 + β1x + E
where
y = dependent variable
X = independent variable
β0 = y-intercept
β1 = slope of the line (defined as rise/run)
E = error variable
The problem objective addressed by the model is to analyze the relationship between two variables, X nd ,Y of which must be interval. To define the relationship between Xnd Y researchers need to know the value of the coefficients β0 and β1 wever, these coefficients are population parameters, which are almost always unknown.
The parameters β0 and β1 are estimated in a way similar to the methods used to estimate the other parameters discussed so far. A researcher has to draw a random sample from the population of interest and calculate the sample statistics needed. However, because β0 and β1 represent the coefficients of a straight line, their estimators are based on drawing a straight line through the sample data. The straight line which is used to estimate β0 and β1 is the “best” straight line, best in the sense that it comes closest to the sample data points.
N.2 Estimating the coefficients
This best straight line, called the least squares line, is derived from calculus and is represented by the following equation:
Figuur 63
where b0 the yintercept, b1 the slope, and figuur 64 ondicted or fitted value of y.
The least squares method produces a straight line that minimizes the sum of the squared differences between the points and the line. The coefficients b0 and b1 are calculated so that the sum of squared deviations
Figuur 65
is minimized. That is, the values of Figuur 64 come closest to the observed values of .
y
Least squares line coefficients
b1 = Sxy / S 2 x
Figuur 66
where
Figuur 67
Figuur 68
Figuur 69
Figuur 70
A shortcut method to manually calculate the slope coefficient b1 is:
b1 = Sxy / S 2 x
Figuur 71
Figuur 72
It has been shown that b0 and b1 are unbiased estimators of b0 and b1 respectively.
CHAPTER O: NONPARAMETRIC STATISTICS
O.1 Wilcoxon rank sum test
The Wilcoxon rank sum test is used to for example determine whether observations from 2 populations allow the researcher to conclude that the location of population 1 is to the left of the location of population 2. The Wilcoxon rank sum has the following characteristics:
The problem objective is to compare two populations.
The data is either ordinal or interval where the normality requirement necessary to perform the equal-variances t-test of Ц1 - Ц2 s unsatisfied.
The samples are independent.
Suppose that we have 2 samples and we want to test the following hypotheses:
H0: The two population locations are the same
H1: The location of population 1 is to the left of the location of population 2
The first step is to rank all the observations (in case of 2 samples and 3 observations each: rank 1 to the smallest observation and rank 6 to the largest observation). The second step is to calculate the sum of the ranks of each sample. The rank sum of sample 1 is denoted as T1, and the rank sum of sample 2 is denoted as T2. T1 is arbitrarily selected as the test statistics and labelled as T (T1=T). A small value of T indicated that most of the smaller observations are in sample 1 and that most of the larger observations are in sample 2. This would imply that the location of population 1 is to the left of the location of population 2.
If the null hypothesis is true and the two populations are identical, then it follows that each possible ranking is equally likely. We are trying to determine whether the value of the test statistic is small enough to reject the null hypothesis at the 5% significance level. Statisticians have generated the sampling distribution of T for various combinations of sample sizes. The critical values are provided in Table 9 in Appendix B.
When the sample size are larger then 10, the test statistic is approximately normally distributed with mean E(T) and standard deviation T where
E ( T) = nl ( nl + n2 + 1) / 2
And
σT = √ nl (n2(nl+ n2 +1) / 12
Bron
- Deze samenvatting is gebaseerd op het studiejaar 2013-2014.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Contributions: posts
Spotlight: topics
Online access to all summaries, study notes en practice exams
- Check out: Register with JoHo WorldSupporter: starting page (EN)
- Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)
How and why use WorldSupporter.org for your summaries and study assistance?
- For free use of many of the summaries and study aids provided or collected by your fellow students.
- For free use of many of the lecture and study group notes, exam questions and practice questions.
- For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
- For compiling your own materials and contributions with relevant study help
- For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.
Using and finding summaries, notes and practice exams on JoHo WorldSupporter
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
- Use the summaries home pages for your study or field of study
- Use the check and search pages for summaries and study aids by field of study, subject or faculty
- Use and follow your (study) organization
- by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
- this option is only available through partner organizations
- Check or follow authors or other WorldSupporters
- Use the menu above each page to go to the main theme pages for summaries
- Theme pages can be found for international studies as well as Dutch studies
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
- Check out: Why and how to add a WorldSupporter contributions
- JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
- Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form
Quicklinks to fields of study for summaries and study assistance
Main summaries home pages:
- Business organization and economics - Communication and marketing -International relations and international organizations - IT, logistics and technology - Law and administration - Leisure, sports and tourism - Medicine and healthcare - Pedagogy and educational science - Psychology and behavioral sciences - Society, culture and arts - Statistics and research
- Summaries: the best textbooks summarized per field of study
- Summaries: the best scientific articles summarized per field of study
- Summaries: the best definitions, descriptions and lists of terms per field of study
- Exams: home page for exams, exam tips and study tips
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
- Studies: Bedrijfskunde en economie, communicatie en marketing, geneeskunde en gezondheidszorg, internationale studies en betrekkingen, IT, Logistiek en technologie, maatschappij, cultuur en sociale studies, pedagogiek en onderwijskunde, rechten en bestuurskunde, statistiek, onderzoeksmethoden en SPSS
- Studie instellingen: Maatschappij: ISW in Utrecht - Pedagogiek: Groningen, Leiden , Utrecht - Psychologie: Amsterdam, Leiden, Nijmegen, Twente, Utrecht - Recht: Arresten en jurisprudentie, Groningen, Leiden
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
1447 | 1 |
Add new contribution