Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)
- 3074 keer gelezen
Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.
To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.
Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.
Example (relative) frequency distribution:
Gender | Frequence | Proportion | Percentage |
Male | 150 | 0.43 | 43% |
Female | 200 | 0.57 | 57% |
Total | 350 (=n) | 1.00 | 100% |
Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.
A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.
A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.
Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.
A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then it's easier to divide them into intervals.
A stem-and-leaf plot represents each observation using a stem and a leaf; two numbers that form an observation if you put them together. This kind of graph only is useful if there is few data available and you want to show the data quickly.
When visual displays are given for a population, then they're called population distributions. When they're given for samples, they're called sample distributions.
The data can be shown using a curve in a graph. The bigger the sample and the more data, the more similarities between the sample graph and the curve of the population. The shape of a graph contains information on the distribution of the data. Most used is the normal distribution, a bell shape. This shape is symmetrical. If the x-axis indicates the value of a variable, then the y-axis indicates the relative frequency of the value. The highest point is in the middle, so the value in the middle is the most prevalent.
Another possibility is a U-shaped graph. The most prevalent values are then the lowest and the highest scores, which indicates polarization.
The two ends of a curve are called tails. If one tail is longer than the other and the distribution isn't symmetrical, then the distribution must be skewed either to the right or to the left.
The average is the most well known measure to describe the center of data for a frequency distribution of a quantitative variable. The average is also called the mean and it is calculated as the sum of the observations divided by the total number of observations. For example, if a variable (y) has the values 34 (y1), 55 (y2) and 64 (y3), then the mean (ȳ) is (34 + 55 + 64)/3 = 51. The mean is pronounced as y-bar.
The formula for calculating the mean is: ȳ = .
The symbol ∑ is the Greek letter sigma, this means the sum of what is behind. The small letter i means 1 till n (the sample size). So ∑ yi means y1 + y2 + … + yn (the sum of all observations).
The mean can only be used for quantitative data and is very sensitive to outliers; exceptionally high or low values.
For multiple samples (n1 and n2), multiple means can be found (ȳ1 and ȳ2).
Another way to describe the center is the median. The median is the observation that falls in the middle of the ordered sample. If a variable has values 1, 3, 5, 8 and 10, then the median is 5. In case of an even number of observations, such as 1, 3, 8 and 10, then the median is (3 + 8)/2 = 5,5.
Important rules about the median are:
Apart from quantitative data the median can also be found for categorial data on an ordinal scale, because the median requires a certain order in the observations.
For completely symmetrical data the median and the mean should be the same.
The mean lies closer to the tail than the median for a skewed distribution.
The median is not sensitive to outliers. This is both positive and negative. On the one hand, if there is just one outlier in the data, the median doesn't give a biased portray of the data. On the other hand, there can be a huge variability and the median might still give the same value.
Compared to the mean, the median represents the sample better in case of outliers. The median gives more information if the distribution is very skewed. However, there are also cases where the median is less favorable for representing the data. When the data is only binary (only 0 or 1), then the median is the proportion of the number of times that 1 is observed. Also in other cases where the data is highly discrete, the mean represents the data better than the median does.
Another position is the mode; the value that is most prevalent. The mode is useful for very discrete variables, mostly categorical data.
The variability of data refers to the values of a variable from the data, for instance the income from the respondents. The variability can be displayed in several ways.
First, the range can be calculated; the difference between the lowest and the highest observation. As an example for the values 4, 10, 16 and 20. The range is 20 – 4 = 16.
However, the most used method for showing the variability of data, is calculating the standard deviation (s). A deviation is the difference between a measured value (yi) and the mean of the sample (ȳ), so it is (yi – ȳ). Every observation has its own deviation, positive when the observation has a higher value than the mean, negative when the observation has a lower value than the mean. It's possible to calculate this for each observation separately but it's also possible to calculate the standard deviation of a variable, by using the sum of all deviations. The formula for the standard deviation is:
The upper part of the formula, ∑ (yi – ȳ)2, is called the sum of squares. This part squares all the deviations from the observations. The information given by the standard deviation, is how much an observation typically deviates from the mean, so how much the data varies. When the standard deviation is 0, there is no variability at all.
The variance is:
S2 =
The variance is the mean of the squares of the deviations. The standard deviation is used more often as an indication of the variability than the variance.
When data is available for the entire population, then instead of n-1 the population size is used for calculating the standard deviation.
For interpreting s, the so-called empirical rule can be used for bell-shaped distributions:
68% of data lies between ȳ – s and ȳ + s.
95% of data lies between ȳ – 2s and ȳ + 2s.
Most or all of observations lie between ȳ – 3s and ȳ + 3s.
Outliers have a big effect on the standard deviation.
Distributions can be interpreted with several kinds of positions. One way to divide a distribution in parts, is using percentiles. The pth percentile is the point where p% of the observations fall below or at that point and the rest of observations, (100-p)%, falls above. A percentile indicates a point in a graph, not part of a graph.
Another way is to divide a distribution in four parts. The 25th percentile is then called the lower quartile and the 75th percentile the upper quartile. Half of data is inbetween and is called the interquartile range (IQR). The median splits the IQR in two parts. The lower quartile is the median of the first half and the upper quartile is the median of the second half. An advantage of the IQR compared to the range and the standard deviation is that the IQR is insensitive to outliers.
Five positions are often used to give a summary of a distribution: minimum, lower quartile, median, upper quartile and maximum. The positions can be shown in a boxplot, a graph that indicates the variability of data. The box of a boxplot contains the central 50% of the distribution.
The horizontal lines of a boxplot towards the minimum and maximum are called the whiskers. Extreme outliers are indicated with a spot outside of the whiskers. An observation is regarded an outlier when it falls more than 1,5 IQR below the lower quartile or above the upper quartile. A boxplot makes the outliers very explicit, this should be a trigger for the researcher to check again if the research methods have been used properly.
Several sorts of graphs help to compare two or more groups, for instance a relative frequency distribution, histogram or two boxplots next to each other.
Another position is the z-score. This is the number of standard deviations that a value differs from the mean. The formula is: z = (observation – mean) / standard deviation. Contrary to other positions, the z-score can give information about a specific value.
Statistics is often about the association between two variables; whether one variable has an influence on another. This is called bivariate analysis.
Most often a research studies the effect of an explanatory variable (also called independent variable) on a response variable (also called dependent variable). The output of the response variable is caused by the explanatory variable.
The influence from one variable on another can be portrayed graphically in several ways. A contingency table lists the results with the combination of variables. A scatterplot is a graph with the explanatory variable on the x-axis and the response variable on the y-axis. For every outcome that suffices both variables a dot is shown. The intensity of an association is called the correlation. Regression analysis predicts the value of y for a given value x. When an association exists between variables, this doesn't necessarily mean that there is causality. For multiple variables, multivariate analysis is used.
In statistics it's important not to loose sight of the difference between the statistic that describes only the sample and the parameter that describes the entire population. Greek letters are used for the population parameters, Roman letters are used for the sample statistics. For a sample ȳ indicates the mean and s indicates the standard deviation. For a population μ indicates the population mean and σ the standard deviation of the population. The mean and the standard deviation can also be regarded as variables. For a population this isn't possible, because there is only one population.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Summary of Statistical methods for the social sciences by Agresti, 5th edition, 2018. Summary in English.
Selected contributions of other WorldSupporters on the topic of Data: distributions, connections and gatherings
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
2646 | 2 |
Add new contribution