Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Book summary
- 1693 reads
USING DATA TO ANSWER STATISTICAL QUESTIONS
The information we gather with experiments and surveys is collectively called data. Statistics is the art and science of learning from data. Statistical problem solving consists of four things:
The three main components of statistics for answering a statistical question are:
Probability is a framework for quantifying how likely various possible outcomes are.
SAMPLE VERSUS POPULATION
The entities that are measured in a study are called the subjects. This usually means people, but it can also be schools, countries or days. The population is the set of all the subjects of interest. In practice, we usually have data for only some of the subjects who belong to that population. These subjects are called a sample.
Descriptive statistics refers to methods for summarizing the collected data. The summaries usually consist of graphs and numbers such as averages and percentages. Inferential statistics are used when data are available from a sample only, but we want to make a decision or prediction about the entire population. Inferential statistics refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population.
A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. The true parameter values are almost always unknown, thus we use sample statistics to estimate the parameter values.
A sample is random when everyone in the population has the same chance of being included in the sample. Random sampling allows us to make powerful inferences about populations. The margin of error is a measure of the expected variability from one random sample to the next random sample.
The formula for calculating the approximate margin of error is: . In this case, ‘n’ is the number of subjects.
USING DATA TO ANSWER STATISTICAL QUESTIONS
The information we gather with experiments and surveys is collectively called data. Statistics is the art and science of learning from data. Statistical problem solving consists of four things:
The three main components of statistics for answering a statistical question are:
Probability is a framework for quantifying how likely various possible outcomes are.
SAMPLE VERSUS POPULATION
The entities that are measured in a study are called the subjects. This usually means people, but it can also be schools, countries or days. The population is the set of all the subjects of interest. In practice, we usually have data for only some of the subjects who belong to that population. These subjects are called a sample.
Descriptive statistics refers to methods for summarizing the collected data. The summaries usually consist of graphs and numbers such as averages and percentages. Inferential statistics are used when data are available from a sample only, but we want to make a decision or prediction about the entire population. Inferential statistics refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population.
A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. The true parameter values are almost always unknown, thus we use sample statistics to estimate the parameter values.
A sample is random when everyone in the population has the same chance of being included in the sample. Random sampling allows us to make powerful inferences about populations. The margin of error is a measure of the expected variability from one random sample to the next random sample.
The formula for calculating the approximate margin of error is: . In this case, ‘n’ is the number of subjects.
DIFFERENT TYPES OF DATA
A variable is any characteristic observed in a study. The data values that we observe for a variable are called observations. A variable can be categorical and quantitative.
Key features to describe quantitative variables are the centre and the variability (spread) of the data (e.g: average amount of hours spent watching TV every day). Key feature to describe categorical variables is the relative number of observations in various categories. (e.g: the percentage of days in a year that it was sunny)
Quantitative variables can be discrete and continuous. A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3 (e.g: the number of pets in a household). A quantitative variable is continuous if its possible values form an interval, such as 0.16, 0,13, 2,32 (e.g: weight: 68,3 kg).
The distribution of a variable describes how the observations fall (are distributed) across the range of possible values. The modal category is the category with the largest frequency.
A frequency table is a listing of possible values for a variable, together with the number of observations for each value.
Category | A | B | C |
Frequency | 17 | 23 | 9 |
Proportion | 0.347 | 0.469 | 0.184 |
Percentage | 34.7% | 46.9% |
THE ASSOCIATION BETWEEN TWO CATEGORICAL VARIABLES
When analysing data the first step is to distinguish between the response variable and the explanatory variable. The response variable is the outcome variable on which comparisons are made. If the explanatory variable is categorical, it defines the groups to be compared with respect to values for the response variable. If the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to values for the response variable. The explanatory variable should explain the response variable (e.g: survival status is a response variable and smoking status is the explanatory variable).
An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.
A contingency table is a display for two categorical variables. Conditional proportions are proportions which formation is conditional on ‘x’. A conditional proportion should be conditional to something. A conditional proportion is also a percentage. The proportion of the totals (e.g: percentage of total amount of ‘no’) is called a marginal proportion.
There is probably an association between two variables if there is a clear explanatory/response relationship, that dictates which way we compute the conditional proportions. Conditional proportions are useful in determining if there’s an association. A variable can be independent from another variable.
THE ASSOCIATION BETWEEN TWO QUANTITATIVE VARIABLES
We examine a scatterplot to study association. There is a difference between a positive association and a negative association. If there is a positive association, x goes up as y goes up. If there is a negative association, x goes up as y goes down.
Correlation describes the strength of the linear association. Correlation (r) summarizes th direction of the association between two quantitative variables and the strength of its linear trend. It can take a value between -1 and 1. A positive value for r indicates a positive association and a negative value for r indicates a negative association. The closer r is to 1, the closer the data points fall to a straight line and the stronger the linear association is. The closer r is to 0, the weaker the linear association is.
The properties of the correlation:
The correlation r can be calculated as following:
N is the number of points. and ȳ are means and
.....read moreHOW PROBABILITY QUANTIFIES RANDOMNESS
Probability is the way we quantify uncertainness. It measures the chances of the possible outcomes for random phenomena. A random phenomenon is an everyday occurrence for which the outcome is uncertain. With random phenomena, the proportion of times that something happens is highly random and variable in the short run, but very predictable in the long run. The law of large numbers states that if the number of trials increases, the proportion of occurrences of any outcome approaches a given number. The probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations.
Different trials of a random phenomena are independent if the outcome of any one trial is not affected by the outcome of any other trial (e.g: if you have three children who are boys, the chance of the next child being a girl is not higher, but still ½).
In the subjective definition of probability, the probability is not based on objective data, but rather subjective information. The probability of an outcome is defined to be a personal probability. This is called Bayesian statistics.
FINDING PROBABILITIES
The sample space is the set of all possible outcomes (e.g: with being pregnant, the sample space is: {boy, girl}). An event is a subset of the sample space. An event corresponds to a particular outcome or a group of possible outcomes (e.g: a particular outcome or a group of possible outcomes). The probability of an event has the following formula:
For example, if you want to know the probability of the event throwing 6 with a fair dice, you calculate it like this:
The rest of the sample space for event A is called the complement of A. The complement of an event consists of all outcomes in the sample space that are not in the event.
Events that do not share any outcomes in common are disjoint (e.g: two events, A and B, are disjoint if they do not have any common outcomes). The chance that in the case of two events, A and B, both occur is called the intersection. The event that the outcome is A or B is the union of A and B.
There are three general rules for calculating the probabilities:
SUMMARIZING POSSIBLE OUTCOMES AND THEIR PROBABILITIES
All possible outcomes and probabilities are summarized in a probability distribution. There is a normal and a binomial distribution. A random variable is a numerical measurement of the outcome of a random phenomenon. The probability distribution of a discrete random variable assigns a probability to each possible value. Numerical summaries of the population are called parameters and a population distribution is a type of probability distribution, one that applies for selecting a subject at random from a population.
The formula for the mean of a probability distribution for a discrete random variable is:
μ= ΣxP(x)
It is also called a weighted average, because some outcomes are likelier to occur than others, so a regular mean would be insufficient here. The mean of a probability distribution of random variable X is also called the expected value of X. The standard deviation of a probability distribution measures the variability from the mean. It describes how far values of the random variable fall, on the average, from the expected value of the distribution. A continuous variable is measured in a discrete manner, because of rounding. A probability distribution for a continuous random variable is used to approximate the probability distribution for the possible rounded values.
PROBABILITIES FOR BELL-SHAPED DISTRIBUTIONS
The z-score for a value x of a random variable is the number of standard deviations that x falls from the mean. It is calculated as:
The standard normal distribution is the normal distribution with mean and standard deviation . It is the distribution of normal z-scores.
PROBABILITIES WHEN EACH OBSERVATION HAS TWO POSSIBLE OUTCOMES
An observation is binary if it has one of two possible outcomes (e.g: accept or decline, yes or no). A random variable X that counts the number of observations of a particular type has a probability distribution called the binomial distribution. There are a few conditions for a binomial distribution:
The formula for the binomial probabilities for any n is:
The binomial distribution is valid if the sample size is less than 10% of the population. There are a couple of formulas for the binomial distribution:
and
HOW SAMPLE PROPORTIONS VARY AROUND THE POPULATION PROPORTION
The sample distribution of a statistic is the probability distribution that specifies probabilities for the possible values the statistic can take. The population distribution from which you take the sample. Values of its parameters are fixed, but usually unknown. Data distribution is the distribution of the sample data. It is also called sample proportion. Sampling distribution is the distribution of a sample statistic such as a sample proportion. Sampling distributions describe the variability that occurs from sample to sample.
For a random sample size n from a population with proportion p of outcomes in a particular category, the sampling distribution of the sample proportion in that category has:
and
For a large sample size n, the binomial distribution has a normal distribution. The central limit theorem states that the sampling distribution of the sample mean x̄ often has approximately a normal distribution. This result applies no matter what the shape of the population distribution from which the samples are taken. The standard deviation of the sampling distribution has the following formula:
The larger the sample, the closer the sample mean tends to fall to the population mean.
POINT AND INTERVAL ESTIMATES OF POPULATION PARAMETERS
A point estimate is a single number that is our best guess for the parameter (e.g: 25% of all Dutch people are above 1,80m). An interval estimate is an interval of numbers within which the parameter value is believed to fall (e.g: between 20% and 30% of the Dutch people are above 1,80m). The margin of error gives the lower border and the upper border of the margin.
A good estimator of a parameter has two properties:
An interval estimate is designed to contain the parameter with some chosen probability, such as 0.95. Confidence intervals are interval estimates that contain the parameter with a certain degree of confidence. A confidence interval is an interval containing the most believable values for a parameter. The probability that this method produces an interval that contains the parameter is called the confidence level. A sampling distribution of a sample proportion gives the possible values for the sample proportion and their probabilities and is a normal distribution if np is larger than 15 and n(1-p) is larger than 15. The margin of error measures how accurate the point estimate is likely to be in estimating a parameter.
CONSTRUCTING A CONFIDENCE INTERVAL TO ESTIMATE A POPULATION PROPORTION
The point estimate of the population proportion is the sample proportion. The standard error is the estimated standard deviation of a sampling distribution. The formula for the standard error is:
The greater the confidence level, the greater the interval. The margin of error decreases with bigger samples, because the standard error decreases with bigger samples. The larger the sample, the narrower the interval. If using a 95% confidence interval over time, then 95% of the intervals would give correct results, containing the population proportion.
CONSTRUCTING A CONDIFENCE INTERVAL TO ESTIMATE A POPULATION MEAN
The standard error for the population mean has the following formula:
The t-score is like a z-score, but a bit larger, and comes from a bell-shaped distribution that has slightly thicker tails than a normal distribution. The distribution that uses the t-score and the standard error, rather than the z-score and the standard deviation is called the t-distribution. The standard deviation of the t-distribution is a bit larger than 1, with the precise value depending on what is called the degrees of freedom. The t-score has
.....read moreSTEPS FOR PERFORMING A SIGNIFICANCE TEST
A hypothesis is a statement about the population. A significance test is a method for using data to summarize the evidence about a hypothesis. The null hypothesis (H0) is a statement that the parameter takes a particular value (e.g: probability of getting a baby girl: p = 0.482). The alternative hypothesis (Ha) states that the parameter falls in some alternative range of values. A significance test has five steps:
SIGNIFICANCE TESTS ABOUT PROPORTIONS
The steps for the significance test are the same for proportions. The biggest assumption made here is that the sample size is large enough that the sampling distribution is approximately normal. The hypotheses are the following for significance tests about proportions:
and or
This is called a one-sided alternative hypothesis, because it has values falling only on one side of the null hypothesis value. A two-sided alternative hypothesis has the form of:
The test statistic of a significance test about proportions is:
or
The P-value of a test statistic of a significance test about proportions is the left- or right-tail probability of a test statistic value even more extreme than observed. Smaller P-values indicate stronger evidence against the null hypothesis, because the data would be more unusual if the null hypothesis were true. In a two-sides test, the P-value is the probability of a single tail doubled. The significance level is a number such that we reject H0 if the P-value is less than or equal to that number. The most common significance level is 0.05. If the data provide evidence to reject H0 and accept Ha, the data is called statistically significant. If Ha is rejected, this does not mean that
.....read moreCATEGORICAL RESPONSE: COMPARING TWO PROPORTIONS:
Bivariate methods is the general category of statistical methods used when we have two variables. The outcome variable on which comparisons are made is called the response variable. The binary variable that specifies the groups is the explanatory variable. In an independent sample, observations in one sample are independent from observations in another sample. If two samples have the same subjects, they are dependent. If each subject in one sample is matched with a subject in another sample there are matched pairs and the data is dependent as well.
The formula for the standard error for comparing two proportions is:
A 95% confidence interval for the difference between two population proportions has the following formula:
The proportion (p̂) is called a pooled estimate, since it pools the total number of successes and total number of observations from two samples. This uses the presumption p1=p2. The test statistic uses the following formula:
The standard error for the test statistic uses the following formula:
QUANTITATIVE RESPONSE: COMPARING TWO MEANS:
The standard error for comparing two means has the following formula:
A 95% confidence interval for the difference between two population means has the following formula:
The confidence interval for the difference between two population means uses the t-distribution and not the z-distribution. Interpreting a confidence interval for the difference of means uses the following criteria:
The test statistic of a significance test comparing two population means uses the following formula:
It uses minus zero because the null hypothesis is that there is no difference between the groups and is thus zero.
OTHER WAYS OF COMPARING MEANS AND COMPARING PROPORTIONS
If it is reasonable to expect that the variability as
INDEPENDENCE AND DEPENDENCE (ASSOCIATION)
Conditional percentages refer to a sample data distribution, conditional on a category. They form the conditional distribution. If the probabilities for two different categorical variables are the same in the same category, then these variables are independent. If the probabilities for two different categorical variables differ, then these variables are dependent. Dependence refers to the population, so if there is barely any difference between two categorical variables in a sample, it could be independent, even though they differ.
TESTING CATEGORICAL VARIABLES FOR INDEPENDENCE
The expected cell count is the mean of the distribution for the count in any particular cell. The formula for the expected cell count is the following:
The chi-squared statistic summarizes how far the observed cell counts in a contingency table fall from the expected cell counts for a null hypothesis. It is the test statistic for the test of independence. The formula for the chi-squared statistic is:
The sampling distribution using the chi-squared statistic is called the chi-squared probability distribution. The chi-squared probability distribution has several properties:
The degrees of freedom in a table with r rows and c columns can be calculated as following:
If a response variable is identified and the population conditional distributions are identical, they are said to be homogeneous. The chi-squared test is then referred to as a test of homogeneity. The degrees of freedom value in a chi-squared test indicates how many parameters are needed to determine all the comparisons for describing the contingency table. The chi-squared test can test for independence, but it cannot provide information about the strength and the direction of the associations and provide information about the practical significance, only about the statistical significance. When testing particular proportion values for a categorical variable, the chi-squared statistic is referred to as a goodness-of-fit statistic. The statistic summarizes how well the hypothesized values predict what happens with the observed data.
DETERMINING THE STRENGTH OF THE ASSOCIATION
A measure of association is a statistic or a parameter that summarizes the strength of the dependence between two variables. The association can be measured by looking at the difference of two associations. The formula for the difference of the two proportions is the following:
The ratio of two proportions is also a measure of association. This is also called the relative risk. The relative risk uses the following formula:
The relative risk has several properties:
MODEL HOW TWO VARIABLES ARE RELATED
A regression line is a straight line that predicts the value of a response variable ‘y’ from the value of an explanatory variable ‘x’. The correlation is a summary measure of association. The regression line uses the following formula:
The data is plotted before a regression line is made, because it can be strongly influenced by outliers. The regression equation is often called a prediction equation. The difference between y - ŷ, between an observed outcome y and its predicted value ŷ is the prediction error, called the residual. The average of the residuals is zero. The regression line has a smaller sum of squared residuals than any other line. It is called the least squares line. The population regression equation has the following formula:
This formula is a model. A model is a simple approximation for how variables relate in a population. The probability distributions of y values at a fixed value of x is a conditional distribution (e.g: the means of annual income for people with 12 years of education).
DESCRIBE STRENGTH OF ASSOCIATION
Correlation does not differentiate between response and explanatory variables. The formula for the slope uses the correlation and can be calculated as following:
Using this formula, the y-intercept can be calculated:
The slope can’t be used to determine the strength of the association, because it determines on the units of measurement. The correlation is the standardized version of the slope. The formula for the correlation is the following:
A property of the correlation is that at any particular x value, the predicated value of y is relatively closer to its mean than x is to its mean. If a particular ‘x’ value falls 2.0 standard deviations from the mean with a correlation of 0.80, then the predicted ‘y’ is ‘r’ times that many standard deviations from its mean, so the predicted ‘y’ would be 0.80 times 2.0 standard deviations from the mean. The predicted ‘y’ is relatively closer to its mean than ‘x’ is to its mean. This is regression toward the mean. If the first observation is extreme, the second observation will be more toward the mean and will be less extreme.
Predicting ‘y’ using ‘x’ with the regression equation is called the residual sum of squares and this uses the following formula:
The measure r squared is interpreted as proportional reduction in error (e.g: if r squared = 0.40, the error using y-hat to predict y is 40% smaller than the error using y-bar to predict y). The formula for r squared
.....read moreONE-WAY ANOVE: COMPARING SEVERAL MEANS
The inferential method for comparing means of several groups is called analysis of variance, also called ANOVA. Categorical explanatory variables in multiple regression and in ANOVA are referred to as factors, also known as independent variables. An ANOVA with only one independent variable is called a one-way ANOVA.
Evidence against the null hypothesis in an ANOVA test is stronger when the variability within each sample is smaller or when the variability between groups is larger. The formula for the F (ANOVA) test is:
When the null hypothesis is true, the mean of the F-distribution is approximately 1. If the null hypothesis is wrong, then F>1. This also increases if the sample size increases. The larger the F-statistic, the smaller the P-value. The F-distribution has two degrees of freedom values:
and
The ANOVA test has five steps:
If the sample sizes are equal, the within-groups estimate of the variance is the mean of the g sample variances for the g groups. It uses the following formula:
If the sample sizes are equal, the between-groups estimate of the variance uses the following formula:
The ANOVA F-test is robust to violations if the sample size is large enough. If the population sample sizes are not equal, the F test works quite well as long as the largest group standard deviation is no more than about twice the smallest group standard deviation. Disadvantages of the F-test are that it tells us whether groups are different, but it does not tell us which groups are different.
ESTIMATING DIFFERENCES IN GROUPS FOR A SINGLE FACTOR
The F-test only tells us if groups are different, not how different and which groups are different. Confident intervals can. A confidence interval for comparing means uses the following formula:
The degrees of freedom for the confidence interval is:
If the confidence interval does not contain 0, we can infer that the population means are different. Methods that control the probability that all confidence intervals will contain the true differences in means are called multiple comparison methods. Multiple comparison methods compare pairs of means with a confidence level that applies simultaneously to the
.....read moreCOMPARE TWO GROUPS BY RANKING
Nonparametric statistical methods are inferential methods that do not assume a particular form of distribution (e.g: the assumption of a normal distribution) for the population distribution. The Wilcoxon test is the best known nonparametric method. Nonparametric methods are useful when the data are ranked and when the assumption of normality is inappropriate.
The Wilcoxon test sets up a distribution using the probability of each difference of the mean rank. This test has five steps:
The sum of the ranks can also be used, instead of the mean of the ranks. When conducting the Wilcoxon test, a z-test can also be conducted if the sample is large enough. This z-test has the following formula:
A Wilcoxon test can also be conducted by converting quantitative observations to ranks. The Wilcoxon test is not affected by outliers (e.g: an extreme outlier gets the lowest/highest rank, no matter if it’s a bit higher or lower than the number before that). The difference between the population medians can also be used if the distribution is highly skewed, but this requires the extra assumption that the population distribution of the two groups have the same shape. The point estimate of the difference between two medians equals the median of the differences between the two groups. A sample proportion can also be used, by checking what the proportion is of observations in group one that’s better than group two. If there is a proportion of 0.50, then there is no effect. The closer the proportion gets to 0 or 1, the greater the difference between the two groups.
NONPARAMETRIC METHODS FOR SEVERAL GROUPS AND FOR MATCHED PAIRS
The test for comparing mean ranks of more than two groups is called the Kruskal-Wallis test. This test has five steps:
It is
.....read moreUSING DATA TO ANSWER STATISTICAL QUESTIONS
The information we gather with experiments and surveys is collectively called data. Statistics is the art and science of learning from data. Statistical problem solving consists of four things:
The three main components of statistics for answering a statistical question are:
Probability is a framework for quantifying how likely various possible outcomes are.
SAMPLE VERSUS POPULATION
The entities that are measured in a study are called the subjects. This usually means people, but it can also be schools, countries or days. The population is the set of all the subjects of interest. In practice, we usually have data for only some of the subjects who belong to that population. These subjects are called a sample.
Descriptive statistics refers to methods for summarizing the collected data. The summaries usually consist of graphs and numbers such as averages and percentages. Inferential statistics are used when data are available from a sample only, but we want to make a decision or prediction about the entire population. Inferential statistics refers to methods of making decisions or predictions about a population, based on data obtained from a sample of that population.
A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population. The true parameter values are almost always unknown, thus we use sample statistics to estimate the parameter values.
A sample is random when everyone in the population has the same chance of being included in the sample. Random sampling allows us to make powerful inferences about populations. The margin of error is a measure of the expected variability from one random sample to the next random sample.
The formula for calculating the approximate margin of error is: . In this case, ‘n’ is the number of subjects.
DIFFERENT TYPES OF DATA
A variable is any characteristic observed in a study. The data values that we observe for a variable are called observations. A variable can be categorical and quantitative.
Key features to describe quantitative variables are the centre and the variability (spread) of the data (e.g: average amount of hours spent watching TV every day). Key feature to describe categorical variables is the relative number of observations in various categories. (e.g: the percentage of days in a year that it was sunny)
Quantitative variables can be discrete and continuous. A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3 (e.g: the number of pets in a household). A quantitative variable is continuous if its possible values form an interval, such as 0.16, 0,13, 2,32 (e.g: weight: 68,3 kg).
The distribution of a variable describes how the observations fall (are distributed) across the range of possible values. The modal category is the category with the largest frequency.
A frequency table is a listing of possible values for a variable, together with the number of observations for each value.
Category | A | B | C |
Frequency | 17 | 23 | 9 |
Proportion | 0.347 | 0.469 | 0.184 |
Percentage | 34.7% | 46.9% |
THE ASSOCIATION BETWEEN TWO CATEGORICAL VARIABLES
When analysing data the first step is to distinguish between the response variable and the explanatory variable. The response variable is the outcome variable on which comparisons are made. If the explanatory variable is categorical, it defines the groups to be compared with respect to values for the response variable. If the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to values for the response variable. The explanatory variable should explain the response variable (e.g: survival status is a response variable and smoking status is the explanatory variable).
An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.
A contingency table is a display for two categorical variables. Conditional proportions are proportions which formation is conditional on ‘x’. A conditional proportion should be conditional to something. A conditional proportion is also a percentage. The proportion of the totals (e.g: percentage of total amount of ‘no’) is called a marginal proportion.
There is probably an association between two variables if there is a clear explanatory/response relationship, that dictates which way we compute the conditional proportions. Conditional proportions are useful in determining if there’s an association. A variable can be independent from another variable.
THE ASSOCIATION BETWEEN TWO QUANTITATIVE VARIABLES
We examine a scatterplot to study association. There is a difference between a positive association and a negative association. If there is a positive association, x goes up as y goes up. If there is a negative association, x goes up as y goes down.
Correlation describes the strength of the linear association. Correlation (r) summarizes th direction of the association between two quantitative variables and the strength of its linear trend. It can take a value between -1 and 1. A positive value for r indicates a positive association and a negative value for r indicates a negative association. The closer r is to 1, the closer the data points fall to a straight line and the stronger the linear association is. The closer r is to 0, the weaker the linear association is.
The properties of the correlation:
The correlation r can be calculated as following:
N is the number of points. and ȳ are means and
.....read moreHOW PROBABILITY QUANTIFIES RANDOMNESS
Probability is the way we quantify uncertainness. It measures the chances of the possible outcomes for random phenomena. A random phenomenon is an everyday occurrence for which the outcome is uncertain. With random phenomena, the proportion of times that something happens is highly random and variable in the short run, but very predictable in the long run. The law of large numbers states that if the number of trials increases, the proportion of occurrences of any outcome approaches a given number. The probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations.
Different trials of a random phenomena are independent if the outcome of any one trial is not affected by the outcome of any other trial (e.g: if you have three children who are boys, the chance of the next child being a girl is not higher, but still ½).
In the subjective definition of probability, the probability is not based on objective data, but rather subjective information. The probability of an outcome is defined to be a personal probability. This is called Bayesian statistics.
FINDING PROBABILITIES
The sample space is the set of all possible outcomes (e.g: with being pregnant, the sample space is: {boy, girl}). An event is a subset of the sample space. An event corresponds to a particular outcome or a group of possible outcomes (e.g: a particular outcome or a group of possible outcomes). The probability of an event has the following formula:
For example, if you want to know the probability of the event throwing 6 with a fair dice, you calculate it like this:
The rest of the sample space for event A is called the complement of A. The complement of an event consists of all outcomes in the sample space that are not in the event.
Events that do not share any outcomes in common are disjoint (e.g: two events, A and B, are disjoint if they do not have any common outcomes). The chance that in the case of two events, A and B, both occur is called the intersection. The event that the outcome is A or B is the union of A and B.
There are three general rules for calculating the probabilities:
It is important to both produce and consume research. A research consumer is important, because to effectively know something or to put a theory or treatment in to use, it is imperative that the research consumer knows the evidence behind the evidence-based treatment. It is important to be able to decide how valuable and useful a research really is.
Both research producers and research consumers share an interest in psychological phenomena, such as behaviour or emotion. They also both share a commitment to the practice of empiricism: to answer psychological questions with systematic observations.
The cupboard theory is the idea that young animals (but also your dog) clings on to the caregiver because the caregiver provides food. The contact comfort theory is the idea that young animals (but also your dog) clings on to the caregiver because the caregiver provides warmth and contact comfort. These theories have been tested and followed the empirical cycle.
THE EMPIRICAL CYCLE
The empirical cycle always starts with an observation.
Induction -> Theory -> Deduction -> Prediction ->Testing -> Results -> Evaluation -> Observation - > Induction
Data are a set of observations. Depending on whether the data are consistent with hypotheses based on a theory data may either support or challenge a theory. The best theories should be supported by data from studies, should be parsimonious and falsifiable.
Basic research is used to enhance the general body of knowledge. Applied research is done with a practical problem in mind. Translational research is the dynamic bridge between basic and applied research. E.g: a basic research is about schizophrenia. Translational research is used to develop a new treatment for schizophrenia and applied research is used to see how people diagnosed with schizophrenia can fit better into today’s
.....read moreEXPERIENCE
Experience is not a reliable source of information, because it has no comparison group. A comparison group in research is a group which isn’t affected by the controlled independent variable, so it is possible to really determine whether the independent variable has the effect people think it has.
E.g: Doctors used to take blood from an ill person, because they believed that it cured the illness. Some people recovered and they concluded that they recovered because they bled the patients. This is based on experience, they have experiences that some patients recovered, but they did not have a comparison group, so they had no way of knowing that the recovery was because of bleeding the patient. To make sure that it had this effect, they should have had a group with people who were ill, but were not bled, to see what would have happened.
When we are using personal experience to determine whether something works or not, we don’t have a comparison group as well. “My knee feels better with this tape”, but you don’t know how it would’ve felt if you didn’t use that tape. There is no comparison group, so it is not possible to give a conclusive answer, based on empirical evidence.
In real-world situation situations, there are several possible explanations for an outcome. In research, these alternative explanations are called confounds. Experience is confounded, because you do not know the cause of an effect, although you might think you do. When you use tape to lessen the pain in your knee, you don’t know whether the tape caused the pain to diminish. A researcher can see the situation from outside, but you can only see one condition and all you have is your experience.
Behavioural research is probabilistic. This means that it’s findings are not expected to explain all cases all the time. The conclusions of research are meant to explain a certain proportion of the cases. The two big problems with using experience as a source of information is that there is no comparison group and that experience is confounded.
INTUITION
People use their intuition to make decisions, although it is not a reliable source of information, because intuition is biased. There are ways our intuition is biased:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
JoHo kan jouw hulp goed gebruiken! Check hier de diverse studentenbanen die aansluiten bij je studie, je competenties verbeteren, je cv versterken en een bijdrage leveren aan een tolerantere wereld
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Field of study
Add new contribution