Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Deze samenvatting is gebaseerd op het studiejaar 2013-2014.
CHAPTER A: BASICS OF STATISTICS
Statistics is a way to get information from data. There are two main branches of statistics:
Descriptive statistics, which are concerned with methods of organizing, summarizing, and presenting data in a convenient and informative way. Descriptive statistics make use of graphical and numerical techniques to summarize and present data in a clear way. The actual technique used depends on what specific information needs to be extracted.
Inferential statistics, which is a body of methods used to draw conclusions or inferences about characteristics of a population based on sample data (although, a sample that is only a small fraction of the size of the population can lead to correct inferences only a certain percentage of the time).
Statistical inference problems involve three key concepts:
A population is the group of all items of interest to a researcher (note: population does not necessarily refer to a group of people). It is frequently very large and may, in fact, be infinitely large. A descriptive measure of a population is called a parameter. In most applications of inferential statistics the parameter represents the information which is needed.
A sample is a set of data drawn from the population. A descriptive measure of a sample is called a statistic. Statistics are used to make inferences about parameters.
Statistical inference is the process of making an estimate, prediction, or decision about a population based on sample data. In the statistical inference there are two measures of reliability:
the confidence level, which is the proportion of times that an estimating procedure will be correct; and
the significance level, which measures how frequently the conclusion will be wrong in the long run.
Some basic terms related to the concept of data:
A variable is some characteristic of a population or sample. The name of the variable is usually represented using upper case letters such as X, Y, and Z.
The values of the variable are the possible observations of the variable.
Data are the most observed values of a variable.
There are three types of data:
Interval data are real numbers, (for instance, incomes and distances). This type of data is also referred to as quantitative or numerical.
The values of nominal data are categories. For instance, answers to questions about marital status produce nominal data. The values are not numbers but instead are words describing the categories. Nominal data are also called qualitative or categorical.
Ordinal data appear to be nominal, but their values are in order. Because the only constraint that is imposed on the choice of codes is that the order must be maintained, any set of codes that are in order can be used.
The critical difference between those three types of data is that the intervals or differences between values of interval data are consistent and meaningful.
For instance, the difference between grades of 10 and 8 is the same two-grade difference that exists between 8 and 6.Thus, a researcher can calculate the difference and interpret the results. Because the codes representing ordinal data are arbitrarily assigned except for the order, a researcher cannot calculate and interpret differences.
All calculations are permitted on interval data. A set of interval data is often described by calculating the average. No calculations can be performed on the codes of nominal data, because these codes are completely arbitrary. Thus, calculations based on the codes used to store nominal data are meaningless. All that a researcher can do with nominal data is count the occurrences of each category. The only permissible calculations on ordinal data are ones involving a ranking process.
The data types can be placed in order of the permissible calculations. At the top of the list there is the interval data type (because virtually all computations are allowed). At the bottom of the list there is the nominal data type (because no calculations other than determining frequencies are permitted). In between interval and nominal data lies the ordinal data type. Note: higher-level data types may be treated as lower-level ones. For instance, in universities the grades in a course (interval data), can be converted to letter grades (ordinal data). Lower-level data types cannot be treated as high-level types.
The variables whose observations constitute the data are given the same name as the type of data. Thus, for instance, nominal data are the observations of a nominal variable.
CHAPTER B: GRAPHICAL DESCRIPTIVE TECHNIQUES
The only allowable calculation on nominal data is to count the frequency of each value of the variable. The data can be summarized in a table that presents the categories and their counts called a frequency distribution. A relative frequency distribution lists the categories and the proportion with which each occurs. There are two graphical methods which can be used to present a picture of the data:
A bar chart, which is often used to display frequencies; and
a pie chart, which graphically shows relative frequencies.
A bar chart is created by drawing a rectangle representing each category. The height of the rectangle represents the frequency and its base is arbitrary. A pie chart is simply a circle subdivided into slices that represent the categories. It is drawn so that the size of each slice is proportional to the percentage corresponding to that category.
There are several graphical methods which are used when the data are interval. The most important of these graphical methods is the histogram – it can be used to summarize interval data or to help explain an important aspect of probability.
A frequency distribution for interval data is created by counting the number of observations that fall into each of a series of intervals (classes) that cover the complete range of observations.
Although the frequency distribution provides information about how the numbers are distributed, the information is more easily understood and imparted by drawing a picture or graph. The graph is called a histogram. A histogram is created by drawing rectangles whose bases are the intervals and whose heights are the frequencies.
The number of class intervals selected depends entirely on the number of observations in the data set. The more observations available, the larger the number of class intervals needed to draw a useful histogram. For instance, for less than 50 observations, a researcher would normally create between 5 and 7 classes; for more than 50000 observations, a researcher would normally use between 17-20 classes. (More detailed guidelines are presented in Table 2.6 on page 35.) Alternatively, a researcher can use Sturge’s formula, which recommends that the number of class intervals be determined by the following:
Number of class intervals = 1 + 3.3 log(n)
For instance, if n = 100, number of class intervals = 1 + 3.3 log(100) = 1 + 3.3(2) = 7.6 (which is rounded to 8).
The approximate width of the classes is determined by subtracting the smallest observation from the largest and dividing the difference by the number of classes. Thus,
Class width = (Largest observation – Smallest observation) / Number of classes
The result is often rounded to some convenient value. Consequently, the class limits are defined by selecting a lower limit for the first class from which all other limits are determined. The only condition to apply is that the first class interval must contain the smallest observation.
The shape of histograms is described on the basis of the following characteristics:
Symmetry - A histogram is said to be symmetric if, when a vertical line down the center of the histogram is drawn, the two sides are identical in shape and size.
Skewness - A skewed histogram is one with a long tail extending to either the right or the left. The one which extends to the right is called positively skewed, and the one which extends to the left is called negatively skewed.
Number of modal classes - A mode is the observation that occurs with the greatest frequency. A modal class is the class with the largest number of observations. A unimodal histogram is one with a single peak. A bimodal histogram is one with two peaks, not necessarily equal in height. Bimodal histograms often indicate that two different distributions are present.
Bell shape - A special type of symmetric unimodal histogram is one that is bell shaped (such as the one presented in Figure 2.10 on page 37).
One of the drawbacks of the histogram is that potentially useful information can be lost by classifying the observations. A stem-and-leaf display is a method which partially overcomes this loss. The first step in developing a stem-and-leaf display is to split each observation into two parts, a stem and a leaf. There are several different ways of doing this. For instance, the number 15.6 can be split so that the stem is 15 and the leaf is 6. In this definition the stem consists of the digits to the left of the decimal and the leaf is the digit to the right of the decimal. Another method can define the stem as 1 and the leaf as 5. In this definition the stem is the number of tens and the leaf is the number of ones. The stem-and-leaf display is similar to a histogram turned on its side. The length of each line represents the frequency in the class interval defined by the sets. The advantage of the stem-and-leaf display over the histogram is that the actual observations can be seen.
The frequency distribution lists the number of observations that fall into each class interval. A relative frequency distribution can also be created by dividing the frequencies by the number of observations. The relative frequency distribution highlights the proportion of the observations that fall into each class. In some situations a researcher may want to determine the proportion of observations that fall below each of the class limits. In such cases he needs to create cumulative relative frequency distribution. Another way of presenting this information is the ogive, which is a graphical representation of the cumulative relative frequencies.
Data can be also classified in the following way:
cross-sectional data, which are the observations are measured at the same time;
time-series data, which represent measurements at successive points in time.
Time-series data are often graphically depicted on a line chart, which is a plot of the variable over time. It is created by plotting the value of the variable on the vertical axis and the time periods on the horizontal axis.
Techniques applied to single sets of data are called univariate. When a researcher wants to depict the relationship between variables, bivariate methods are required.
A cross-classification table (also called a cross-tabulation table) is used to describe the relationship between two nominal variables. There are several ways to store the data to be used to produce a table and/or a bar or pie chart.
The data are in two columns where the first column represents the categories of the first nominal variable and the second column stores the categories for the second variable. Each row represents one observation of the two variables. The number of observations in each column must be the same.
The data are stored in two or more columns where each column represents the same variable in a different sample or population. To produce a cross-classification table, the number of observations of each category in each column has to be counted.
Researchers often need to know how two interval variables are related. The technique used to describe the relationship between such variables is called a scatter diagram. To draw a scatter diagram a researcher needs data for two variables. In applications where one variable depends to some degree on the other variable the dependent variable is labeled Y and the other, called the independent variable is labeled X. In other cases where there is no dependency evident, the variables are labeled arbitrarily.
To determine the strength of the linear relationship a researcher needs to draw a straight line through the points in such a way that the line represents the relationship. If most of the points fall close to the line it can be said that there is a linear relationship. If most of the points appear to be scattered randomly with only an impression of a straight line, there is no, or at best, a weak linear relationship (however, there may be some other type of relationship, e.g. a quadratic or exponential one).
Usually, when one variable increases and the other variables also increases, it can be said that there is a positive linear relationship. When the two variables tend to move in opposite directions, the nature of their association is described as a negative linear relationship.
However, if two variables are linearly related, it does not mean that one is causing the other. In fact, it can never be concluded that one variable causes another variable. Thus, correlation is not causation.
Graphical excellence is a term which applies to techniques that are informative and concise and that communicate information clearly to their viewers.
Graphical excellence is achieved when the following characteristics apply:
The graph presents large data sets concisely and coherently.
The ideas and concepts the researcher wants to deliver are clearly understood by the viewer.
The graph encourages the viewer to compare two or more variables.
The display induces the viewer to address the substance of the data and not the form of the graph.
There is no distortion of what the data reveal.
Researchers should be aware of possible methods of graphical deception. Firstly, a researcher should be careful about graphs without a scale on one axis. Secondly, he or she should also avoid being influenced by a graph’s caption. Additionally, perspective is often distorted if only absolute changes in value, rather than percentage changes, are reported. For instance, 15% growth in revenues can be made to appear more dramatic by stretching the vertical axis – a technique that involves changing the scale on the vertical axis so that a given euro amount is represented by a greater height than before. As a result, the rise in revenues appears to be greater, because the slope of the graph is visually (but not numerically) steeper. The expanded scale is usually accommodated by employing a break in the vertical axis or by shortening the vertical axis so that the vertical scale start at a point greater than zero. The effect of making slopes appear steeper can also be created by shrinking the horizontal axis, in which case points on the horizontal axis are moved closer together. Just the opposite effect is obtained by stretching the horizontal axis - spreading out the points on the horizontal axis to increase the distance between them so that slopes and trends will appear to be less steep. Similar illusions can be created with bar charts by stretching or shrinking the vertical or horizontal axis. Another method which is used to create distorted impressions with bar charts is to construct the bars so that their widths are proportional to their heights. Lastly, a researcher should also be careful about size distortions, particularly in pictograms, which replace the bars with pictures of objects to enhance the visual appeal.
In general, preparation of a statistical report should contain the following:
Statement of objectives;
Description of the experiment;
Description of the results; and
Discussion of limitations of the statistical techniques.
CHAPTER C: NUMERICAL DESCRIPTIVE TECHNIQUES
Measures of central location are the following:
mean,
median, and
mode.
The arithmetic mean is also referred to as the mean or average. The mean is computed by summing the observations and dividing by the number of observations. The observations in a sample are labeled x1, x2, … , xn where x1 is the first observation, x2 is the second, and so on until xn, where n is the sample size. The sample mean is denotedx. In a population, the number of observations is labeled N and the population mean is denoted by μ.
The median is calculated by placing all the observations in order (ascending or descending). The observation that falls in the middle is the median. However, when there is an even number of observations, the median is determined by averaging the two observations in the middle. The sample and population medians are computed in the same way. When there is a relatively small number of extreme observations (either very small or very large), the median usually produces a better measure of the center location than the mean.
The mode is the observation that occurs with the greatest frequency. Both the statistic (sample mode) and parameter (population mode) are computed in the same way. However, for populations and large samples, it is preferable to report the modal class. There are two main disadvantages of using the mode as a measure of central location: in a small sample it may not be a very good measure and it may not be unique.
When the data are interval, any of the three measures of central location can be used. However, for ordinal and nominal data the calculation of the mean is not valid. Because the calculation of the median begins by placing the data in order, this statistic is appropriate for ordinal data. The mode, which is determined by counting the frequency of each observation, is appropriate for nominal data.
When to compute the Mean:
Interval Data
Descriptive measurement of central location
When to compute the Median:
Ordinal data or interval data (with extreme observations)
Descriptive measurement of central location
When to compute the Mode:
Nominal, ordinal, or interval data.
Measures of variability are the following:
Range,
Variance,
Standard deviation, and
Coefficient of variation.
Range= Largest observation – Smallest observation
The advantage of the range is simplicity; however, because the range is calculated from only two observations, it provides no information about the other observations.
To compute the sample variance s2, the sample mean x has to be calculated first. Next the difference (also called the deviation) between each observation and the mean has to be calculated. Then, the deviations are squared and summed. Finally, the sum of squared deviations is divided by n – 1.
Note: some of the deviations are positive and some are negative. When added together, the sum is 0. This will always be the case because the sum of the positive deviations will always equal the sum of the negative deviations. Consequently, the deviations are squared to avoid the “cancelling effect.”
The variance provides only a rough idea about the amount of variation in the data. However, this statistic is useful when comparing two or more sets of data of the same type of variable.
If the variance of one data set is larger than that of a second data set, it can be said the observations in the first set display more variation than the observations in the second set.
The standard deviation is simply the positive square root of the variance (note: the unit associated with the standard deviation is the unit of the original data set).
Population standard deviation: σ = (σ2)1/2
Sample standard deviation: s = (s2)1/2
Knowing the mean and standard deviation allows a researcher to extract useful bits of information. The information depends on the shape of the histogram. If the histogram is bell shaped, the Empirical Rule can be used.
EMPERICAL RULE
Approximately 68% of all observations fall within one standard deviation of the mean.
Approximately 95% of all observations fall within two standard deviations of the mean.
Approximately 99.7% of all observations fall within three standard deviations of the mean.
A more general interpretation of the standard deviation is derived from Chebysheff’s Theorem, which applies to all shapes of histograms.
CHEBYSHEFF’S THEOREM
The proportion of observations in any sample or population that lie within k standard deviations of the mean is at least
For instance, when k = 2, Chebysheff’s Theorem states that at least three-quarters (75%) of all observations lie within two standard deviations of the mean. With k = 3, Chebysheff’s Theorem states that at least eight-ninths of all observations lie within three standard deviations of the mean.
The Empirical Rule provides approximate proportions, whereas Chebysheff’s Theorem provides lower bounds on the proportions contained in the intervals.
The coefficient of variation of a set of observations is the standard deviation of the observations divided by their mean.
Population coefficient of variation: CV = σ/mu
The measures of variability described above can be used only for interval data. None of the above can be used to describe the variability of ordinal data. There are no measures of variability for nominal data.
Measures of relative standing are designed to provide information about the position of particular values relative to the entire data set. The median, which is a measure of central location, is also a measure of relative standing.
The Pth percentile is the value for which P percent are less than that value and
(100 – P)% are greater than that value. There are special names for the 25th, 50th, and 75th percentiles. Because these three statistics divide the set of data into quarters, these measures of relative standing are also called quartiles. The first or lower quartile is labeled Q1 and it is equal to the 25th percentile. The second quartile, which is labeled Q2, is equal to the 50th percentile, which is also the median. The third or upper quartile, which is labeled Q3, is equal to the 75th percentile. Besides quartiles, percentiles can also be converted into quintiles and deciles. Quintiles divide the data into fifths, and deciles divide the data into tenths.
The following formula allows approximating the location of any percentile (Lp is the location of the Pth percentile).
Lp = (n+1)(P/100)
Usually, the shape of the histogram can be predicted from the quartiles. For instance, if the first and second quartiles are closer to each other than are the second and third quartiles, the histogram is positively skewed. If the first and second quartiles are farther apart than the second and third quartiles, the histogram is approximately symmetric. The box plot is particularly useful in this regard.
The quartiles can be used to create another measure of variability, the interquartile range, which is expressed as follows:
Interquartile range = Q3 – Q1
The interquartile range measures the spread of the middle 50% of the observations. Large values of this statistic mean that the first and third quartiles are far apart, indicating a high level of variability.
The box plot, another graphical technique, graphs five statistics, the minimum and maximum observations, and the first, second, and third quartiles. It also depicts other features of a set of data. The box plot is particularly useful when comparing two or more data sets. An example of the box plot is presented in the Figure 4.1 on page 119. The three vertical lines of the box are the first, second, and third quartiles. The lines extending to the left and right are called whiskers. Any points that lie outside the whiskers are called outliers. The whiskers extend outward to the smaller of 1.5 times the interquartile range or to the most extreme point that is not an outlier.
Outliers are unusually large or small observations. Because an outlier is considerably removed from the main body of the data set, its validity is suspect.
Consequently, outliers should be checked to determine that they are not the result of an error in recording their values. Outliers can also represent unusual observations that should be investigated.
Because the measures of relative standing are computed by ordering the data, these statistics are appropriate for ordinal as well as for interval data. Furthermore, because the interquartile range is calculated by taking the difference between the upper and lower quartiles, it too can be employed to measure the variability of ordinal data.
Chapter D-I
To be found in the PDF's below.
Contributions: posts
Spotlight: topics
Online access to all summaries, study notes en practice exams
- Check out: Register with JoHo WorldSupporter: starting page (EN)
- Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)
How and why would you use WorldSupporter.org for your summaries and study assistance?
- For free use of many of the summaries and study aids provided or collected by your fellow students.
- For free use of many of the lecture and study group notes, exam questions and practice questions.
- For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
- For compiling your own materials and contributions with relevant study help
- For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.
Using and finding summaries, study notes and practice exams on JoHo WorldSupporter
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
- Use the menu above every page to go to one of the main starting pages
- Starting pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
- Use the topics and taxonomy terms
- The topics and taxonomy of the study and working fields gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
- Check or follow your (study) organizations:
- by checking or using your study organizations you are likely to discover all relevant study materials.
- this option is only available trough partner organizations
- Check or follow authors or other WorldSupporters
- by following individual users, authors you are likely to discover more relevant study materials.
- Use the Search tools
- 'Quick & Easy'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject.
- The search tool is also available at the bottom of most pages
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
- Check out: Why and how to add a WorldSupporter contributions
- JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
- Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form
Quicklinks to fields of study for summaries and study assistance
Field of study
- All studies for summaries, study assistance and working fields
- Communication & Media sciences
- Corporate & Organizational Sciences
- Cultural Studies & Humanities
- Economy & Economical sciences
- Education & Pedagogic Sciences
- Health & Medical Sciences
- IT & Exact sciences
- Law & Justice
- Nature & Environmental Sciences
- Psychology & Behavioral Sciences
- Public Administration & Social Sciences
- Science & Research
- Technical Sciences
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
467 | 1 |
Add new contribution