Understanding data: distributions, connections and gatherings

In short: Data

Data is any collection of facts, statistics, or information that can be used for analysis or decision-making. It can be raw or processed, and it can be in the form of numbers, text, images, or sounds.

Understanding data: distributions, connections and gatherings

Understanding and knowing what sort of data you have is essential for conducting succesfull statistical tests. Here you will find general introductiory information about the types of data you will most likely encounter in your researches.

Data: distributions, connections and gatherings

Collecting data
Quantitative versus qualitative data
Associations
Interpreting and displaying raw data
Central tendency
Measuring variability
Variance and standard deviation

Collecting data

Data collection can be subdivided into three groups:

Observational measurements: in this case, behavior is observed directly. This can be done in every study in which the behavior that is to be examined, can be seen directly. Researchers can observe the behavior directly, or they can make audio- or video-recordings, from which information about the participants can be deduced. In observational studies the dependent and independent variable(s) of interest are not manipulated. No claims can be made about cause-and-effect relationships between variables.
Physical measurements: these are used when the researcher is interested in the relation between behavior and not-directly observable physical processes. It refers here to processes of the human body, that often can not be observed by eye. For example, hart rate, sweating, brain activity and hormonal changes.
Self reportage measurements: participants answer questions on questionnaires or interviews themselves. There are three kinds of self reportages: 1) cognitive: these measure what people think 2) affective: these measure what people feel and 3) behavioral: these measure what people do.

Quantitative versus qualitative data

In statistics, a subdivision is made into quantitative and qualitative data. Quantitative data results from a certain measurement, for example the grade on a test, weight or the scores on a scale. A measurement instrument is used to determine how much a certain characteristic is present in an object.

Qualitative data is also called frequency or categorical data. Qualitative data refers to categorizing objects. For example, 15 people are categorized as ‘very anxious’, 33 people are categorized as ‘neutral’ and 12 people are categorized as ‘little anxious’. The data consists of frequencies for each category.

Associations

Most research is conducted to discover or examine associations between variables, for example to examine the relation between sleeping habits and school achievement. The first research technique to examine relations is the correlational method. With the correlation method the researcher observes two variables to discover if there is a relation between them. The experimental method is used when the researcher is interested in the cause-and-effect relation between variables. A change in one variable will cause a change in another variable. This method has two essential characteristics. First, there is a manipulation. This implies that the researcher changes the values of a variable (X). Next, values of a second variable (Y) are measured to see if changes of X influence the values of Y. Second, there is control. This means that the researcher has to keep the research situation constant. When all other variables/conditions are kept constant, the researcher can claim that changes in Y are caused by X and not by another variable. It is important to be aware of the distinction between correlation and causation. A correlation implies that there is a relation between variables, but this does not tell us anything about the direction of the effect. Hence, you can not say that changes in one variable are caused by the other variable. Three conditions have to be met in order to make statements about causality:

Covariance: variables should covary together. A high score on the x-variable should be in accordance with a high score on the y-variable.
Direction: the cause should precede the consequence.
Exclusion of the influence of other variables: it may be the case that a third variable (z) influences both x and y.

Interpreting and displaying raw data

Frequency distributions, proportions and intervals

When participants are being measured, the obtained data are called raw data. These data are difficult to interpret. Therefore, steps have to be taken in order to process these data. Raw data is only a collection of numbers. Structure can be added by, for example, displaying the data in a graph. When reaction times are measured, one can for example make a frequency distribution. In a frequency distribution, you note how often a certain value (here: reaction time) occurred. This helps you to visualize which value (here: reaction time) occurred most frequently. Describing proportions and percentages is also useful in a frequency distribution. A proportion is calculated by dividing the frequency that belongs to a certain X-value by the total amount of participants. For example, if two people, that belong to a class of 20 persons, scored a six (X=6), the proportion for the score six is 2/20 = 0.10. The formula is:

\[p = \frac{f}{N}\]

p: proportion for the score
f: frequency of the score
N: total amount of participations or observations

Because proportions are always calculated in relation with the total amount of participants or observations (N), we call them relative frequencies. Percentages can be obtained by multiplying proportions by hundred. Thus:

\[p_{(100)} = \frac{f}{N}\times 100%\]

p₍₁₀₀₎: percentage of proportion for the score
f: frequency of the score
N: total amount of participations or observations

Sometimes, many different scores are possible. In that case, it is better to make grouped frequency distributions. Here, we make groups of scores instead of only looking at individual values. The groups (or intervals) are called class-intervals. Instead of noting for example each possible length, you make groups of different length-intervals. For example, a group with the interval of 100 to 120 cm and a group with the interval of 121 to 140 cm. You can note the group behind each frequency.

Example:

\[p^{121-140}_{(100)} = \frac{f^{121-140}}{N}\times 100%\]

Graphs

A frequency distribution can be displayed well in a figure. This is called a graph. An example is a histogram. The horizontal axis is called the x-axis, and the vertical axis is called the y-axis. The categories are displayed on the horizontal axis, and the frequencies are displayed on the vertical axis. To make a histogram, bars have to be drawn. The height of each bar is in accordance with the frequency of the category. A bar chart is in principle similar to a histogram, except that the bars are not put directly next to each other. Also the values that differentiate strongly from the other values are displayed. These values are called outliers are often (but not always) not useful. Besides graphs, lines can also be applied to the obtained data. The most frequently used line is the normal curve. This line is highest in the middle of the distribution, and decreases symmetrically at both sides of the middle. The normal distribution is symmetric, but not every distribution looks like this. A bimodal distribution for example, has two peeks. If a distribution has only one peek, it is called a unimodal distribution. A distribution can also be asymmetric, because the distribution is longer on on of the sides. A distribution with a ‘tail’ to the left has a negative skewness, and a distribution with a tail to the right has a positive skewness.
Besides histograms and bar charts, one can also use stem-and-leaf-plots. In such plots, each score is subdivided into two parts. The first number (for example the 1 of 12) is called the stem, and the second number (for example the 2 in 12) is called the stem. When you draw a plot, first note all stems (the first number). Next, note each leaf of each score. A stem-and-leaf-plot offers you the opportunity to quickly find individual scores, which may be useful for calculations. This is not possible with a frequency distribution.

Percentiles

Individual scores are called raw scores. However, these scores do not provide much information. For example, if you tell someone you had a scored of 76 points on your exam, it is not clear to the other person whether this is good or bad. To be able to interpret such a score, it should be clear what the mean of all scores is and how your score relates to the mean. The rank or percentile rank is a number that implies what percentage of all individuals in the distributions scored below a certain value (in this example: 76 points). Such a score is also called a percentile. The percentile rank refers to the percentage, whilst the percentile refers to a score. You might know that you scored 76 points out of 90 on a test. But that figure has no real meaning unless you know what percentile you fall into. If you know that your score is in the 90th percentile, that means you scored better than 90% of people who took the test. To determine percentiles and percentile ranks, it first has to be examined how many individuals score below any value. The result is called cumulative percentages. These percentages show what percentage of individuals score below a certain X-value and add up to 100 for the highest possible value of X. An easy way to use percentile is by means of quartiles. The first quartile (Q1) is 25%, the second quartile (Q2) is 50% (thus, the mean) and the third quartile is 75%. The distance between the first and third quartile is called the interquartile range (IQR). 1.5 times the IQR above Q3 or below Q1 is a criterion to identify possible outliers. All these data can be displayed in a boxplot. The so-called ‘box’ is from the first to the third quartile. In addition, the median is displayed in the box by a horizontal line. In addition, there is a vertical line from the lowest to the highest observations, that also goes through the box. Outliers are displayed with an asterisk above or below the line.

Central tendency

Measurements of the central tendency are measurements that display where on the scale the distribution is centered. There are three ways to do so: the mode, the median and the mean. These manners differ in the amount of data they use.

Mode: is used least frequently and is often least useful. The mode is simply the most frequently occurring score. In case of two adjacent scores, the mean of these two numbers is taken.
Median: the score that corresponds to the point of which 50% of all scores falls below when the data are ordered numerically. Therefore, the median is also called the 50^th percentile. Imagine that we have the scores 4, 6, 8, 9, and 16. Here, the median is 8. In case of an even number of scores, for example 4, 6, 8, 12, 15, and 16, the median falls between 8 and 12. In this case, we take the mean of the two middle scores as median. Thus, the median is 10 in this case. A useful formula to find the median, is that of the median location:

\[Median\:location = \frac{(N+1)}{2}\]

N: number of scores

Mean: this measurement of the central tendency measurements is used most frequently, because all scores of the distributions are included. The mean is the sum of the scores divided by the total amount of scores. A disadvantage of the mean is that it is influenced by extreme scores. Therefore, the ‘trimmed’ mean is sometimes used. For example, ten scores at both ends of the distribution are excluded. As a result, the more extreme results are excluded and the estimation of the mean becomes more stable. Formula of the mean:

\[Mean = \frac{\sum x}{N}\]

x: individual score of x
N: number of scores

Measuring variability

The variability of a distribution refers to the extent to which scores are spread or clustered. Variability provides a quantitative value to the extent of difference between scores. A large value refers to high variability. The aim of measuring variability is twofold:

Describing the distance than can be expected between scores;
Measuring the representativeness of a scores for the whole distribution.

The range of a measurement is the distance between the highest and lowest score. The lowest score should be subtracted from the highest score. However, the range can provide a wrong image when there are extreme values present. Thus, the disadvantage of the range is that it does not account for all values, but only for the extreme values.

Variance and standard deviation

The standard deviation (SD) is the most frequently used and most important measure for spread. This measurement uses the mean of the distribution as comparison point. Moreover, the standard deviation uses the distance between individual scores and the mean of the data set. By using the standard deviation, you can check whether individual scores in general are far away or close to the mean. The standard deviation can be best understood by means of four steps:

First, the deviation of each individual score to the mean has to be calculated. The deviance is the difference between each individual score and the mean of the variable. The formula is:

\[Deviation\: score = x - µ\]

x: individual score of x
μ: mean of the variable

In the next step, calculate the mean of the deviation scores. This can be obtained by adding all deviations scores and dividing the sum by the number of deviation scores (N). The deviation scores are combined always zero. Before computing the mean, each deviation score should be placed between brackets and squared.

\[mean\:of\:the\:deviation\:scores = \frac{\sum{(x-\mu)}}{N}\]

x: individual score of x
μ: mean of the variable
N: number of scores

Next, the mean of the squared sum can be computed. This is called the variance. The formula of the variance is:

\[σ^2= \frac{\sum {(x-μ)^{2}}}{N}\]

σ²: squared sum or variance
x: individual score of x
μ: mean of the variable
N: number of scores

Finally, draw the square root of the variance. The result is the standard deviation. The final formula for the standard deviation is thus:

\[σ= \sqrt {\frac{\sum {(x-μ)^{2}}}{N}}\]

σ: standard deviation
x: individual score of x
μ: mean of the variable
N: number of scores

Often, the variance is a large and unclear number, because it comprises a squared number. It is therefore useful and easier to understand to compute and present the standard deviation.

In a sample with n scores, the first n-1 scores can vary, but the last score is definite. The sample consists of n-1 degrees of freedom (in short: df).

Systematic variance and error variance

The total variance can be subdivided into 1) systematic variance and 2) error variance.

Systematic variance refers to that part of the total variance that can predictably be related to the variables that the researcher examines.
Error variance emerges when the behavior of participants is influenced by variables that the researcher does not examine (did not include in his or her study) or by means of measurement error (errors made during the measurement). For example, if someone scores high on aggression, this may also be explained by his or her bad mood instead of the temperature. This form of variance can not be predicted in the study. The more error variance is present in a data set, the harder it is to determine if the manipulated variables (independent variables) actually are related to the behavior one wants to examine (the dependent variable). Therefore, researchers try to minimize the error variance in their study.

Glossary, practice questions and selected contributions with understanding data

Glossary for Data: distributions, connections and gatherings

Definitions and explanations of relevant terminology generally associated with Data: distributions, connections and gatherings

What are observational, physical and self rapportage measurements?

Observational measurements: behavior is observed directly.
Physical measurements: processes of the human body are observed that often can not be seen by eye. For example, hart rate, sweating, brain activity and hormonal changes.
Self reportage measurements: participants answer questions on questionnaires or interviews themselves.

What is the correlational method?

In the realm of research methodology, the correlational method is a powerful tool for investigating relationships between two or more variables. However, it's crucial to remember it doesn't establish cause-and-effect connections.

Think of it like searching for patterns and connections between things, but not necessarily proving one makes the other happen. It's like observing that people who sleep more tend to score higher on tests, but you can't definitively say that getting more sleep causes higher scores because other factors might also play a role.

Here are some key features of the correlational method:

No manipulation of variables: Unlike experiments where researchers actively change things, the correlational method observes naturally occurring relationships between variables.
Focus on measurement: Both variables are carefully measured using various methods like surveys, observations, or tests.
Quantitative data: The analysis primarily relies on numerical data to assess the strength and direction of the relationship.
Types of correlations: The relationship can be positive (both variables increase or decrease together), negative (one increases while the other decreases), or nonexistent (no clear pattern).

Here are some examples of when the correlational method is useful:

Exploring potential links between variables: Studying the relationship between exercise and heart disease, screen time and mental health, or income and educational attainment.
Developing hypotheses for further research: Observing correlations can trigger further investigations to determine causal relationships through experiments.
Understanding complex phenomena: When manipulating variables is impractical or unethical, correlations can provide insights into naturally occurring connections.

Limitations of the correlational method:

Cannot establish causation: Just because two things are correlated doesn't mean one causes the other. Alternative explanations or even coincidence can play a role.
Third-variable problem: Other unmeasured factors might influence both variables, leading to misleading correlations.

While the correlational method doesn't provide definitive answers, it's a valuable tool for exploring relationships and informing further research. Always remember to interpret correlations cautiously and consider alternative explanations.

What is the experimental method?

In the world of research, the experimental method reigns supreme when it comes to establishing cause-and-effect relationships. Unlike observational methods like surveys or correlational studies, experiments actively manipulate variables to see how one truly influences the other. It's like conducting a controlled experiment in your kitchen to see if adding a specific ingredient changes the outcome of your recipe.

Here are the key features of the experimental method:

Manipulation of variables: The researcher actively changes the independent variable (the presumed cause) to observe its effect on the dependent variable (the outcome).
Control groups: Experiments often involve one or more control groups that don't experience the manipulation, providing a baseline for comparison and helping to isolate the effect of the independent variable.
Randomization: Ideally, participants are randomly assigned to groups to control for any other factors that might influence the results, ensuring a fair and unbiased comparison.
Quantitative data: The analysis focuses on numerical data to measure and compare the effects of the manipulation.

Here are some types of experimental designs:

True experiment: Considered the "gold standard" with a control group, random assignment, and manipulation of variables.
Quasi-experiment: Similar to a true experiment but lacks random assignment due to practical limitations.
Pre-test/post-test design: Measures the dependent variable before and after the manipulation, but lacks a control group.

Here are some examples of when the experimental method is useful:

Testing the effectiveness of a new drug or treatment: Compare groups receiving the drug with a control group receiving a placebo.
Examining the impact of an educational intervention: Compare students exposed to the intervention with a similar group not exposed.
Investigating the effects of environmental factors: Manipulate an environmental variable (e.g., temperature) and observe its impact on plant growth.

While powerful, experimental research also has limitations:

Artificial environments: May not perfectly reflect real-world conditions.
Ethical considerations: Manipulating variables may have unintended consequences.
Cost and time: Can be expensive and time-consuming to conduct.

Despite these limitations, experimental research designs provide the strongest evidence for cause-and-effect relationships, making them crucial for testing hypotheses and advancing scientific knowledge.

What three conditions have to be met in order to make statements about causality?

While establishing causality is a cornerstone of scientific research, it's crucial to remember that it's not always a straightforward process. Although no single condition guarantees definitive proof, there are three key criteria that, when met together, strengthen the evidence for a causal relationship:

1. Covariance: This means that the two variables you're studying must change together in a predictable way. For example, if you're investigating the potential link between exercise and heart health, you'd need to observe that people who exercise more tend to have lower heart disease risk compared to those who exercise less.

2. Temporal precedence: The presumed cause (independent variable) must occur before the observed effect (dependent variable). In simpler terms, the change in the independent variable needs to happen before the change in the dependent variable. For example, if you want to claim that exercising regularly lowers heart disease risk, you need to ensure that the increase in exercise frequency precedes the decrease in heart disease risk, and not vice versa.

3. Elimination of alternative explanations: This is arguably the most challenging criterion. Even if you observe a covariance and temporal precedence, other factors (besides the independent variable) could be influencing the dependent variable. Researchers need to carefully consider and rule out these alternative explanations as much as possible to strengthen the case for causality. For example, in the exercise and heart disease example, factors like diet, genetics, and socioeconomic status might also play a role in heart health, so these would need to be controlled for or accounted for in the analysis.

Additional considerations:

Strength of the association: A strong covariance between variables doesn't automatically imply a causal relationship. The strength of the association (e.g., the magnitude of change in the dependent variable for a given change in the independent variable) is also important to consider.
Replication: Ideally, the findings should be replicated in different contexts and by different researchers to increase confidence in the causal claim.

Remember: Establishing causality requires careful research design, rigorous analysis, and a critical evaluation of all potential explanations. While the three criteria mentioned above are crucial, it's important to interpret causal claims cautiously and consider the limitations of any research study.

What are the percentile and percentile rank?

The terms percentile and percentile rank are sometimes used interchangeably, but they actually have slightly different meanings:

Percentile:

A percentile represents a score that a certain percentage of individuals in a given dataset score at or below. For example, the 25th percentile means that 25% of individuals scored at or below that particular score.
Imagine ordering all the scores in a list, from lowest to highest. The 25th percentile would be the score where 25% of the scores fall below it and 75% fall above it.
Percentiles are often used to describe the distribution of scores in a dataset, providing an idea of how scores are spread out.

Percentile rank:

A percentile rank, on the other hand, tells you where a specific individual's score falls within the distribution of scores. It is expressed as a percentage and indicates the percentage of individuals who scored lower than that particular individual.
For example, a percentile rank of 80 means that the individual scored higher than 80% of the other individuals in the dataset.
Percentile ranks are often used to compare an individual's score to the performance of others in the same group.

Here's an analogy to help understand the difference:

Think of a classroom where students have taken a test.
The 25th percentile might be a score of 70. This means that 25% of the students scored 70 or lower on the test.
If a particular student scored 85, their percentile rank would be 80. This means that 80% of the students scored lower than 85 on the test.

Key points to remember:

Percentiles and percentile ranks are both useful for understanding the distribution of scores in a dataset.
Percentiles describe the overall spread of scores, while percentile ranks describe the relative position of an individual's score within the distribution.
When interpreting percentiles or percentile ranks, it's important to consider the context and the specific dataset they are based on.

What is an outlier?

In statistics, an outlier is a data point that significantly deviates from the rest of the data in a dataset. Think of it as a lone sheep standing apart from the rest of the flock. These values can occur due to various reasons, such as:

Errors in data collection or measurement: Mistakes during data entry, instrument malfunction, or human error can lead to unexpected values.
Natural variation: In some datasets, even without errors, there might be inherent variability, and some points may fall outside the typical range.
Anomalous events: Unusual occurrences or rare phenomena can lead to data points that differ significantly from the majority.

Whether an outlier is considered "interesting" or "problematic" depends on the context of your analysis.

Identifying outliers:

Several methods can help identify outliers. These include:

Visual inspection: Plotting the data on a graph can reveal points that fall far away from the main cluster.
Statistical tests: Techniques like z-scores and interquartile ranges (IQRs) can identify points that deviate significantly from the expected distribution.

Dealing with outliers:

Once you identify outliers, you have several options:

Investigate the cause: If the outlier seems due to an error, try to correct it or remove the data point if justified.
Leave it as is: Sometimes, outliers represent genuine phenomena and should be included in the analysis, especially if they are relevant to your research question.
Use robust statistical methods: These methods are less sensitive to the influence of outliers and can provide more reliable results.

Important points to remember:

Not all unusual data points are outliers. Consider the context and potential explanations before labeling something as an outlier.
Outliers can sometimes offer valuable insights, so don't automatically discard them without careful consideration.
Always document your approach to handling outliers in your analysis to ensure transparency and reproducibility.

What is a histogram?

A histogram is a bar graph that shows the frequency distribution of a continuous variable. It divides the range of the variable into a number of intervals (bins) and then counts the number of data points that fall into each bin. The height of each bar in the histogram represents the number of data points that fall into that particular bin.

The x-axis of the histogram shows the value of the random numbers, and the y-axis shows the frequency of each value. For example, the bar at x = 0.5 has a height of about 50, which means that there are about 50 random numbers in the dataset that have a value of around 0.5.

Histograms are a useful tool for visually exploring the distribution of a dataset. They can help you to see if the data is normally distributed, if there are any outliers, and if there are any other interesting patterns in the data.

Here's an example:

Imagine you have a bunch of socks of different colors, and you want to understand how many of each color you have. You could count them individually, but a quicker way is to group them by color and then count each pile. A histogram works similarly, but for numerical data.

Here's a breakdown:

1. Grouping Numbers:

Imagine a bunch of data points representing things like heights, test scores, or reaction times.
A histogram takes this data and divides it into ranges, like grouping socks by color. These ranges are called "bins."

2. Counting Within Bins:

Just like counting the number of socks in each pile, a histogram counts how many data points fall within each bin.

3. Visualizing the Distribution:

Instead of just numbers, a histogram uses bars to represent the counts for each bin. The higher the bar, the more data points fall within that range.

4. Understanding the Data:

By looking at the histogram, you can see how the data is spread out. Is it mostly clustered in the middle, or are there many extreme values (outliers)?
It's like having a quick snapshot of the overall pattern in your data, similar to how seeing the piles of socks helps you understand their color distribution.

Key things to remember:

Histograms are for continuous data, like heights or test scores, not categories like colors.
The number and size of bins can affect the shape of the histogram, so it's important to choose them carefully.
Histograms are a great way to get a quick overview of your data and identify any interesting patterns or outliers.

What is a bar chart?

A bar chart is a way to visually represent data, but it's specifically designed for categorical data. Imagine you have a collection of objects sorted into different groups, like the colors of your socks or the flavors of ice cream in a carton. A bar chart helps you see how many objects belong to each group.

Here's a breakdown:

1. Categories on the Bottom:

The bottom of the chart shows the different categories your data belongs to, like "red socks," "blue socks," etc. These categories are often represented by labels or short descriptions.

2. Bars for Each Category:

Above each category, a bar extends vertically. The height of each bar represents the count or frequency of items within that category. For example, a high bar for "red socks" means you have many red socks compared to other colors.

3. Comparing Categories:

The main purpose of a bar chart is to compare the values across different categories. By looking at the heights of the bars, you can easily see which category has the most, the least, or how they compare in general.

4. Simple and Effective:

Bar charts are a simple and effective way to present data that is easy to understand, even for people unfamiliar with complex charts.

Key things to remember:

Bar charts are for categorical data, not continuous data like heights or ages.
The length of the bars represents the count or frequency, not the size or value of the items.
Bar charts are great for comparing categories and identifying patterns or trends in your data.

What are measurements of the central tendency?

In statistics, measures of central tendency are numerical values that aim to summarize the "center" or "typical" value of a dataset. They provide a single point of reference to represent the overall data, helping us understand how the data points are clustered around a particular value. Here are the three most common measures of central tendency:

1. Mean: Also known as the average, the mean is calculated by adding up the values of all data points and then dividing by the total number of points. It's a good choice for normally distributed data (bell-shaped curve) without extreme values.

2. Median: The median is the middle value when all data points are arranged in ascending or descending order. It's less sensitive to outliers (extreme values) compared to the mean and is preferred for skewed distributions where the mean might not accurately reflect the typical value.

3. Mode: The mode is the most frequent value in the dataset. It's useful for identifying the most common category in categorical data or the most frequently occurring value in continuous data, but it doesn't necessarily represent the "center" of the data.

Here's a table summarizing these measures and their strengths/weaknesses:

Measure	Description	Strengths	Weaknesses
Mean	Sum of all values divided by number of points	Simple to calculate, reflects all values	Sensitive to outliers, skewed distributions
Median	Middle value after sorting data	Less sensitive to outliers, robust for skewed distributions	Not as informative as mean for normally distributed data
Mode	Most frequent value	Useful for identifying common categories/values	Doesn't represent the "center" of the data, can have multiple modes

Choosing the most appropriate measure of central tendency depends on the specific characteristics and type of your data (categorical or continuous), the presence of outliers, and the distribution of the data points. Each measure offers a different perspective on the "center" of your data, so consider the context and research question when making your selection.

What is the variability of a distribution?

Variability in a distribution refers to how spread out the data points are, essentially indicating how much the values differ from each other. Unlike measures of central tendency that pinpoint a typical value, variability measures describe the "scatter" or "dispersion" of data around the center.

Here are some key points about variability:

Importance: Understanding variability is crucial for interpreting data accurately. It helps you assess how reliable a central tendency measure is and identify potential outliers or patterns in the data.
Different measures: There are various ways to quantify variability, each with its strengths and weaknesses depending on the data type and distribution. Common measures include:
- Range: The difference between the highest and lowest values. Simple but can be influenced by outliers.
- Interquartile Range (IQR): The range between the 25th and 75th percentiles, less sensitive to outliers than the range.
- Variance: The average squared deviation from the mean. Sensitive to extreme values.
- Standard deviation: The square root of the variance, measured in the same units as the data, making it easier to interpret.
Visual Representation: Visualizations like boxplots and histograms can effectively depict the variability in a distribution.

Here's an analogy: Imagine you have a bunch of marbles scattered on the floor. The variability tells you how spread out they are. If they are all clustered together near one spot, the variability is low. If they are scattered all over the room, the variability is high.

Remember, choosing the appropriate measure of variability depends on your specific data and research question. Consider factors like the type of data (continuous or categorical), the presence of outliers, and the desired level of detail about the spread.

What is the range of a measurement?

In the world of measurements, the range refers to the difference between the highest and lowest values observed. It's a simple way to express the spread or extent of a particular measurement. Think of it like the distance between the two ends of a measuring tape – it tells you how much space the measurement covers.

Here are some key points about the range:

Applicable to continuous data: The range is typically used for continuous data, where values can fall anywhere within a specific interval. It wouldn't be meaningful for categorical data like colors or types of fruits.
Easy to calculate: Calculating the range is straightforward. Simply subtract the lowest value from the highest value in your dataset.
Limitations: While easy to calculate, the range has limitations. It only considers the two extreme values and doesn't provide information about how the remaining data points are distributed within that range. It can be easily influenced by outliers (extreme values).

Here are some examples of how the range is used:

Temperature: The range of temperature in a city over a month might be calculated as the difference between the highest and lowest recorded temperatures.
Test scores: The range of scores on an exam could be the difference between the highest and lowest score achieved by students.
Product dimensions: The range of sizes for a particular type of clothing could be the difference between the smallest and largest sizes available.

While the range offers a basic understanding of the spread of data, other measures like the interquartile range (IQR) and standard deviation provide more nuanced information about the distribution and variability within the data.

What is a standard deviation?

A standard deviation (SD) is a statistical measure that quantifies the amount of variation or spread of data points around the mean (average) in a dataset. It expresses how much, on average, each data point deviates from the mean, providing a more informative understanding of data dispersion compared to the simple range.

Formula of the standard deviation:

\[ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2} . \]

where:

s represents the standard deviation
xi is the value of the $i$th data point
xˉ is the mean of the dataset
N is the total number of data points

Key points:

Unit: The standard deviation is measured in the same units as the original data, making it easier to interpret compared to the variance (which is squared).
Interpretation: A larger standard deviation indicates greater spread, meaning data points are further away from the mean on average. Conversely, a smaller standard deviation suggests data points are clustered closer to the mean.
Applications: Standard deviation is used in various fields to analyze data variability, assess normality of distributions, compare groups, and perform statistical tests.

Advantages over the range:

Considers all data points: Unlike the range, which only focuses on the extremes, the standard deviation takes into account every value in the dataset, providing a more comprehensive picture of variability.
Less sensitive to outliers: While outliers can still influence the standard deviation, they have less impact compared to the range, making it a more robust measure.

Remember:

The standard deviation is just one measure of variability, and it's essential to consider other factors like the shape of the data distribution when interpreting its meaning.
Choosing the appropriate measure of variability depends on your specific data and research question.

Understanding data: distributions, connections and gatherings

In short: Data

Data is any collection of facts, statistics, or information that can be used for analysis or decision-making. It can be raw or processed, and it can be in the form of numbers, text, images, or sounds.

3318 reads

Practice Questions for Data: distributions, connections and gatherings

Questions

1. What are the three conditions (provisions) for causality?

2. Immediately after the exam M&T, we determined of 11 randomly chosen students how many of the 40 questions they answered correctly. The results are presented in the stemplot below.

0	2	6	6
1	0	8	9
2	1
3	4	6	8	9

Determine the median.

3. What is the median of the scores 4-6-8-10-18?

4. What is the median of the following numbers: 8, 9, 14, 15?

5. The following five terms are used frequently to summarize the characteristics of a statistical variable: minimum, maximum, first quartile, third quartile, median. What is the right order, from small to large?

6. What does ‘error variance’ mean?

7. How can the standard deviation be computed from the variance?

8. To provide insight in the association between number of cigarettes smoked per day and the time needed to run 2 kilometers, you make a chart based on the data for a number of participants. Which variable do you put on the x-axis?

Answers

1. What are the three conditions (provisions) for causality?

The variables should covary together
The cause should precede the consequence
The influence of other variables should be eliminated

2. Immediately after the exam M&T, we determined of 11 randomly chosen students how many of the 40 questions they answered correctly. The results are presented in the stemplot below.

0	2	6	6
1	0	8	9
2	1
3	4	6	8	9

Determine the median.

3. What is the median of the scores 4-6-8-10-18?

4. What is the median of the following numbers: 8, 9, 14, 15?

11.5

Minimum, first quartile, median, third quartile, maximum

6. What does ‘error variance’ mean?

Error variance is the variance that can not be explained by

Access:

Public

2413 reads

Video for understanding Data distributions and gatherings

Extra clarification with basic concepts of data distibutions and gatherings

Range, variance and standard deviation as measures of dispersion | Khan Academy

Visit the author's profile page.

2142 reads

Selected contributions for Data: distributions, connections and gatherings

Selected contributions of other WorldSupporters on the topic of Data: distributions, connections and gatherings

Which kinds of samples and variables are possible? – Chapter 2

2.1 Which kinds of variables can be measured?
2.2 How does randomization work?
2.3 How do you control variability and bias?
2.4 Which methods can be used for probability sampling?

2.1 Which kinds of variables can be measured?

All characteristics of a subject that can be measured are variables. These characteristics can vary between different subjects within a sample or within a population (like income, sex, opinion). The use of variables is to indicate the variability of a value. As as example, the number of beers consumed per week by students. The values of a variable constitute the measurement scale. Several measurement scales, or ways to differ variables, are possible.

The most important divide is that between quantitative and categorical variables. Quantitative variables are measured in numerical values, such as age, numbers of brothers and sisters, income. Categorical variables (also called qualitative variables) are measured in categories, such as sex, marital status, religion. The measurement scales are tied to statistical analyses: for quantitative variables it is possible to calculate the mean (i.e. the average age), but for categorical variables this isn't possible (i.e. there is no average sex.

Also there are four measurement scales: nominal, ordinal, interval and ratio. Categorical variables have nominal or ordinal scales.

The nominal scale is purely descriptive. For instance with sex as a variable, the possible values are man and woman. There is no order or hierarchy, one value isn't higher than the other.

The ordinal scale on the other hand assumes a certain order. For instance happiness. If the possible values are unhappy, considerably unhappy, neutral, considerably happy and ecstatic, then there is a certain order. If a respondent indicates to be neutral, this is happier than considerably unhappy, which in turn is happier than unhappy. Important is that the distances between the values cannot be measured, this is the difference between ordinal and interval.

Quantitative variables have an interval or ratio scale. Interval means that there are measurable differences between the values. For instance temperate in Celcius. There is an order (30 degrees is more than 20) and the difference is clearly measurable and consistent.

The difference between interval and ratio is that for an interval scale the value can't be zero, but for a ratio scale it can be. So the ratio scale has numerical values, with a certain order, with measurable differences and with zero as a possible value. Examples are percentage or income.

Furthermore there are discrete and continuous variables. A variable is discrete when the possible values can only be limited, separate numbers. A variable is continuous when the values can be anything possible. For instance the number of brothers and sisters is discrete, because it's not possible to have 2.43 brother/sister. And for instance weight is continuous, because it's possible to weigh 70 kilo

Access:

Public

What are the main measures and graphs of descriptive statistics? - Chapter 3

3.1 Which tables and graphs display data?
3.2 How do you describe the center of data using mean, median and mode?
3.3 How can you measure the variability of data?
3.4 How can you measure quartiles and other positions on a distribution?
3.5 How do you call statistics for multiple variables?
3.6 Which letters are used in formulas to mark the difference between the sample and the population?

3.1 Which tables and graphs display data?

Descriptive statistics serves to create an overview or summary of data. There are two kinds of data, quantitative and categorical, each has different descriptive statistics.

To create an overview of categorical data, it's easiest if the categories are in a list including the frequence for each category. To compare the categories, the relative frequencies are listed too. The relative frequency of a category shows how often a subject falls within this category compared to the sample. This can be calculated as a percentage or a proportion. The percentage is the total number of observations within a certain category, divided by the total number of observations * 100. Calculating a proportion works largely similar, but then the number isn't multiplied by 100. The sum of all proportions should be 1.00, the sum of all percentages should be 100.

Frequencies can be shown using a frequency distribution, a list of all possible values of a variable and the number of observations for each value. A relative frequency distributions also shows the comparisons with the sample.

Example (relative) frequency distribution:

Gender	Frequence	Proportion	Percentage
Male	150	0.43	43%
Female	200	0.57	57%
Total	350 (=n)	1.00	100%

Aside from tables also other visual displays are used, such as bar graphs, pie charts, histograms and stem-and-leaf plots.

A bar graph is used for categorical variables and uses a bar for each category. The bars are separated to indicate that the graph doesn't display quantitative variables but categorical variables.

A pie chart is also used for categorical variables. Each slice represents a category. When the values are close together, bar graphs show the differences more clearly than pie charts.

Frequency distributions and other visual displays are also used for quantitative variables. In that case, the categories are replaced by intervals. Each interval has a frequence, a proportion and a percentage.

A histogram is a graph of the frequency distribution for a quantitative variable. Each value is represented by a bar, except when there are many values, then it's easier to divide them into intervals.

A stem-and-leaf plot

Access:

Public

What role do probability distributions play in statistical inference? – Chapter 4

4.1 What are the basic rules of probability?
4.2 What is the difference in probability distributions for discrete and continuous variables?
4.3 How does the normal distribution work exactly?
4.4 What is the difference between sample distributions and sampling distributions?
4.5 How do you create the sampling distribution for a sample mean?
4.6 What is the connection between the population, the sample data and the sampling distribution?

4.1 What are the basic rules of probability?

Randomization is important for collecting data, the idea that the possible observations are known but it's yet unknown which possibility will prevail. What will happen, depends on probability. The probability is the proportion of the number of times that a certain observation is prevalent in a long sequence of similar observations. The fact that the sequence is long, is important, because the longer the sequence, the more accurate the probability. Then the sample proportion becomes more like the population proportion. Probabilities can also be measured in percentages (such as 70%) instead of proportions (such as 0.7). A specific branch within statistics deals with subjective probabilities, called Bayesian statistics. However, most of statistics is about regular probabilities.

A probability is written like P(A), where P is the probability and A is an outcome. If two outcomes are possible and they exclude each other, then the chance that B happens is 1- P(A).

Imagine research about people's favorite colors, whether this is mostly red and blue. Again the assumption is made that the possibilities exclude each other without overlapping. The chance that someone's favorite color is red (A) or blue (B), is P(A of B) = P (A) + P (B).

Next, imagine research that encompasses multiple questions. The research seeks to investigate how many married people have kids. Then you can multiply the chance that someone is married (A) with the chance that someone has kids (B).The formula for this is: P(A and B) = P(A) * P(B if also A). Because there is a connection between A and B, this is called a conditional probability.

Now, imagine researching multiple possibilities that are not connected. The chance that a random person likes to wear sweaters (A) and the chance that another random person likes to wear sweaters (B), is P (A and B) = P (A) x P (B). These are independent probabilities.

4.2 What is the difference in probability distributions for discrete and continuous variables?

A random variable means that the outcome differs for each observation, but mostly this is just referred to as a variable. While a discrete variable has set possible values, a continuous variable can assume any value. Because a probability distribution

Access:

Public

Call to action: Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network?

People who share their statistical knowledge and skills can contact WorldSupporter Statistics for more exposure to a larger audience. Relevant contributions to specific WorldSupporter Statistics Topics are highlighted per topic so that users who are interested in certain statistical topics can broaden their theoretical perspective and international network.

Do you have statistical knowledge and skills and do you enjoy helping others while expanding your international network? Would you like to cooperate with WorldSupporter Statistics? Please contact WorldSupporter with some basics (Where do you live? What's your (statistical) background? How are you helping others at the moment? And how do you see that in relation to WorldSupporter Statistics?)

Understanding data: distributions, connections and gatherings

In short: Data

Data is any collection of facts, statistics, or information that can be used for analysis or decision-making. It can be raw or processed, and it can be in the form of numbers, text, images, or sounds.

1836 reads

Knowledge and assistance for understanding data