Wednesday 29 October 2008

M ED 1.12 Reliability in Research Testing

Reliability

In statistics, reliability is the consistency of a set of measurements or measuring instrument, often used to describe a test. This can either be whether the measurements of the same instrument give or are likely to give the same measurement (test-retest), or in the case of more subjective instruments, such as personality or trait inventories, whether two independent assessors give similar scores (inter-rater reliability). Reliability is inversely related to random error.

Reliability does not imply validity. That is, a reliable measure is measuring something consistently, but not necessarily what it is supposed to be measuring. For example, while there are many reliable tests of specific abilities, not all of them would be valid for predicting, say, job performance. In terms of accuracy and precision, reliability is precision, while validity is accuracy.

In experimental sciences, reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results. It can also be interpreted as the lack of random error in measurement.[1]

In engineering, reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time. It is often reported in terms of a probability. Evaluations of reliability involve the use of many statistical tools. See Reliability engineering for further discussion.

An often-used example used to elucidate the difference between reliability and validity in the experimental sciences is a common bathroom scale. If someone that weighs 200 lbs. steps on the scale 10 times, and it reads "200" each time, then the measurement is reliable and valid. If the scale consistently reads "150", then it is not valid, but it is still reliable because the measurement is very consistent. If the scale varied a lot around 200 (190, 205, 192, 209, etc.), then the scale could be considered valid but not reliable.

Contents

[hide]

Estimation

Reliability may be estimated through a variety of methods that couldn't fall into two types: Single-administration and multiple-administration. Multiple-administration methods require that two assessments are administered. In the test-retest method, reliability is estimated as the Pearson product-moment correlation coefficient between two administrations of the same measure. In the alternate forms method, reliability is estimated by the Pearson product-moment correlation coefficient of two different forms of a measure, usually administered together. Single-administration methods include split-half and internal consistency. The split-half method treats the two halves of a measure as alternate forms. This "halves reliability" estimate is then stepped up to the full test length using the Spearman-Brown prediction formula. The most common internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible split-half coefficients.[2] Cronbach's alpha is a generalization of an earlier form of estimating internal consistency, Kuder-Richardson Formula 20.[2]

Each of these estimation methods isn't sensitive to different sources of error and so might not be expected to be equal. Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be sample dependent. Reliability estimates from one sample might differ from those of a second sample (beyond what might be expected due to sampling variations) if the second sample is drawn from a different population because the true reliability is different in this second population. (This is true of measures of all types--yardsticks might measure houses well yet have poor reliability when used to measure the lengths of insects.)

Reliability may be improved by clarity of expression (for written assessments), lengthening the measure,[2] and other informal means. However, formal psychometric analysis, called the item analysis, is considered the most effective way to increase reliability. This analysis consists of computation of item difficulties and item discrimination indices, the latter index involving computation of correlations between the items and sum of the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative discrimination are replaced with better items, the reliability of the measure will increase.

  • R(t) = 1 − F(t).
  • R(t) = exp( − λt). (where λ is the failure rate)

Classical test theory

In classical test theory, reliability is defined mathematically as the ratio of the variation of the true score and the variation of the observed score. Or, equivalently, one minus the ratio of the variation of the error score and the variation of the observed score:

{\rho}_{xx'}=\frac{{\sigma}^2_T}{{\sigma}^2_X}=1-\frac{{{\sigma}^2_E}}{{{\sigma}^2_X}}

where ρxx' is the symbol for the reliability of the observed score, X; {\sigma}^2_X, {\sigma}^2_T, and {\sigma}^2_Eare the variances on the measured, true and error scores respectively. Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are used to estimate the reliability of a test.

Some examples of the methods to estimate reliability include test-retest reliability, internal consistency reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error in the test somewhat differently.

Item response theory

It was well-known to classical test theorists that measurement precision is not uniform across the scale of measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among high- and low-scoring test-takers. Item response theory extends the concept of reliability from a single index to a function called the information function. The IRT information function is the inverse of the conditional observed score standard error at any given test score. Higher levels of IRT information indicate higher precision and thus greater reliability.

See also

References

  1. Rudner, L.M., & Shafer, W.D. (2001). Reliability. ERIC Digest. College Park, MD: ERIC Clearinghouse on Assessment and Evaluation. [1]
  2. Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78(1), 98-104.

Accuracy and precision

From Wikipedia, the free encyclopedia

(Redirected from Accuracy)

Jump to: navigation, search

"Accuracy" redirects here. For the song by The Cure, see Three Imaginary Boys.

Look up accuracy, precision in Wiktionary, the free dictionary.

In the fields of science, engineering, industry and statistics, accuracy is the degree of closeness of a measured or calculated quantity to its actual (true) value. Accuracy is closely related to precision, also called reproducibility or repeatability, the degree to which further measurements or calculations show the same or similar results. The results of calculations or a measurement can be accurate but not precise; precise but not accurate; neither; or both. A measurement system or computational method is called valid if it is both accurate and precise. The related terms are bias (non-random or directed effects caused by a factor or factors unrelated by the independent variable) and error (random variability), respectively .

Contents

[hide]

[edit] Accuracy versus precision — the target analogy

High accuracy, but low precision

High accuracy, but low precision

High precision, but low accuracy

High precision, but low accuracy

Accuracy is the degree of veracity while precision is the degree of reproducibility. The analogy used here to explain the difference between accuracy and precision is the target comparison. In this analogy, repeated measurements are compared to arrows that are fired at a target. Accuracy describes the closeness of arrows to the bullseye at the target center. Arrows that strike closer to the bullseye are considered more accurate. The closer a system's measurements to the accepted value, the more accurate the system is considered to be.

To continue the analogy, if a large number of arrows are fired, precision would be the size of the arrow cluster. (When only one arrow is fired, precision is the size of the cluster one would expect if this were repeated many times under the same conditions.) When all arrows are grouped tightly together, the cluster is considered precise since they all struck close to the same spot, if not necessarily near the bullseye. The measurements are precise, though not necessarily accurate.

However, it is not possible to reliably achieve accuracy in individual measurements without precision — if the arrows are not grouped close to one another, they cannot all be close to the bullseye. (Their average position might be an accurate estimation of the bullseye, but the individual arrows are inaccurate.) See also Circular error probable for application of precision to the science of ballistics.

[edit] Quantifying accuracy and precision

Ideally a measurement device is both accurate and precise, with measurements all close to and tightly clustered around the known value. The accuracy and precision of a measurement process is usually established by repeatedly measuring some traceable reference standard. Such standards are defined in the International System of Units and maintained by national standards organizations such as the National Institute of Standards and Technology.

In many cases precision can be characterised in terms of the standard deviation of the measurements, sometimes incorrectly called the measurement process's standard error. The smaller the standard deviation, the higher the precision. In some literature, precision is defined as the reciprocal of variance, while many others still confuse precision with the confidence interval. The interval defined by the standard deviation is the 68.3% ("one sigma") confidence interval of the measurements. If enough measurements have been made to accurately estimate the standard deviation of the process, and if the measurement process produces normally distributed errors, then it is likely that 68.3% of the time, the true value of the measured property will lie within one standard deviation, 95.4% of the time it will lie within two standard deviations, and 99.7% of the time it will lie within three standard deviations of the measured value.

This also applies when measurements are repeated and averaged. In that case, the term standard error is properly applied: the precision of the average is equal to the known standard deviation of the process divided by the square root of the number of measurements averaged. Further, the central limit theorem shows that the probability distribution of the averaged measurements will be closer to a normal distribution than that of individual measurements.

With regard to accuracy we can distinguish:

  • the difference between the mean of the measurements and the reference value, the bias. Establishing and correcting for bias is necessary for calibration.
  • the combined effect of that and precision.

A common convention in science and engineering is to express accuracy and/or precision implicitly by means of significant figures. Here, when not explicitly stated, the margin of error is understood to be one-half the value of the last significant place. For instance, a recording of 843.6 m, or 843.0 m, or 800.0 m would imply a margin of 0.05 m (the last significant place is the tenths place), while a recording of 8436 m would imply a margin of error of 0.5 m (the last significant digits are the units).

A reading of 8000 m, with trailing zeroes and no decimal point, is ambiguous; the trailing zeroes may or may not be intended as significant figures. To avoid this ambiguity, the number could be represented in scientific notation: '8.0 × 103 m' indicates that the first zero is significant (hence a margin of 50 m) while '8.000 × 103 m' indicates that all three zeroes are significant, giving a margin of 0.5 m. Similarly, it is possible to use a multiple of the basic measurement unit: '8.0 km' is equivalent to '8.0 × 103 m'. In fact, it indicates a margin of 0.05 km (50 m). However, reliance on this convention can lead to false precision errors when accepting data from sources that do not obey it.

Looking at this in another way, a value of 8 would mean that the measurement has been made with a precision of '1' (the measuring instrument was able to measure only up to 1's place) whereas a value of 8.0 (though mathematically equal to 8) would mean that the value at the first decimal place was measured and was found to be zero. (The measuring instrument was able to measure the first decimal place.) The second value is more precise. Neither of the measured values may be accurate (the actual value could be 9.5 but measured inaccurately as 8 in both instances). Thus, accuracy can be said to be the 'correctness' of a measurement, while precision could be identified as the ability to resolve smaller differences.

Precision is sometimes stratified into:

  • Repeatability — the variation arising when all efforts are made to keep conditions constant by using the same instrument and operator, and repeating during a short time period; and
  • Reproducibility — the variation arising using the same measurement process among different instruments and operators, and over longer time periods.

A common way to statistically measure precision is a Six Sigma tool called ANOVA Gage R&R. As stated before, you can be both accurate and precise. For instance, if all your arrows hit the bull's eye of the target, they are all both near the "true value" (accurate) and near one another (precise).

[edit] Accuracy in binary classification

"Accuracy" is also used as a statistical measure of how well a binary classification test correctly identifies or excludes a condition.

Condition (e.g. Disease)
As determined by "Gold" standard

True

False

Test
outcome

Positive

True Positive

False Positive

→ Positive Predictive Value

Negative

False Negative

True Negative

→ Negative Predictive Value


Sensitivity


Specificity

Accuracy

That is, the accuracy is the proportion of true results (both true positives and true negatives) in the population. It is a parameter of the test.

\text{accuracy}=\frac{\text{number of true positives}+\text{number of true negatives}}{\text{numbers of true positives}+\text{false positives} + \text{false negatives} + \text{true negatives}}

An accuracy of 100% means that the test identifies all sick and well people correctly.

Also see Sensitivity and specificity.

Accuracy may be determined from Sensitivity and Specificity, provided Prevalence is known, using the equation:

accuracy = (sensitivity)(prevalence) + (specificity)(1 − prevalence)

The accuracy paradox for predictive analytics states that predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. It may be better to avoid the accuracy metric in favor of other metrics such as precision and recall.

[edit] Accuracy and precision in psychometrics

In psychometrics the terms accuracy and precision are interchangeably used with validity and reliability respectively. Validity of a measurement instrument or psychological test is established through experiment or correlation with behavior. Reliability is established with a variety of statistical technique (classically Cronbach's alpha).

Bayesian inference

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Bayesian inference is statistical inference in which evidence or observations are used to update or to newly infer the probability that a hypothesis may be true. The name "Bayesian" comes from the frequent use of Bayes' theorem in the inference process. Bayes' theorem was derived from the work of the Reverend Thomas Bayes.[1]

Contents

[hide]

[edit] Evidence and changing beliefs

Bayesian inference uses aspects of the scientific method, which involves collecting evidence that is meant to be consistent or inconsistent with a given hypothesis. As evidence accumulates, the degree of belief in a hypothesis ought to change. With enough evidence, it should become very high or very low. Thus, proponents of Bayesian inference say that it can be used to discriminate between conflicting hypotheses: hypotheses with very high support should be accepted as true and those with very low support should be rejected as false. However, detractors say that this inference method may be biased due to initial beliefs that one holds before any evidence is ever collected. (This is a form of inductive bias).

Bayesian inference uses a numerical estimate of the degree of belief in a hypothesis before evidence has been observed and calculates a numerical estimate of the degree of belief in the hypothesis after evidence has been observed. (This process is repeated when additional evidence is obtained.) Bayesian inference usually relies on degrees of belief, or subjective probabilities, in the induction process and does not necessarily claim to provide an objective method of induction. Nonetheless, some Bayesian statisticians believe probabilities can have an objective value and therefore Bayesian inference can provide an objective method of induction. See scientific method.

Bayes' theorem adjusts probabilities given new evidence in the following way:

P(H|E) = \frac{P(E|H)\;P(H)}{P(E)}

where

  • H represents a specific hypothesis, which may or may not be some null hypothesis.
  • P(H) is called the prior probability of H that was inferred before new evidence, E, became available.
  • P(E | H) is called the conditional probability of seeing the evidence E if the hypothesis H happens to be true. It is also called a likelihood function when it is considered as a function of H for fixed E.
  • P(E) is called the marginal probability of E: the a priori probability of witnessing the new evidence E under all possible hypotheses. It can be calculated as the sum of the product of all probabilities of any complete set of mutually exclusive hypotheses and corresponding conditional probabilities: P(E) = \sum  P(E|H_i)P(H_i).
  • P(H | E) is called the posterior probability of H given E.

The factor P(E | H) / P(E) represents the impact that the evidence has on the belief in the hypothesis. If it is likely that the evidence E would be observed when the hypothesis under consideration is true, but unlikely that E would have been the outcome of the observation, then this factor will be large. Multiplying the prior probability of the hypothesis by this factor would result in a larger posterior probability of the hypothesis given the evidence. Conversely, if it is unlikely that the evidence E would be observed if the hypothesis under consideration is true, but a priori likely that E would be observed, then the factor would reduce the posterior probability for H. Under Bayesian inference, Bayes' theorem therefore measures how much new evidence should alter a belief in a hypothesis.

Bayesian statisticians argue that even when people have very different prior subjective probabilities, new evidence from repeated observations will tend to bring their posterior subjective probabilities closer together. However, others argue that when people hold widely different prior subjective probabilities their posterior subjective probabilities may never converge even with repeated collection of evidence. These critics argue that worldviews which are completely different initially can remain completely different over time despite a large accumulation of evidence.[citation needed]

Multiplying the prior probability P(H) by the factor P(E | H) / P(E) will never yield a probability that is greater than 1, since P(E) is at least as great as P(E \cap H)(where \capdenotes "and"), which equals P(E|H)\,P(H)(see joint probability).

The probability of E given H, P(E | H), can be represented as a function of its second argument with its first argument held fixed. Such a function is called a likelihood function; it is a function of H alone, with E treated as a parameter. A ratio of two likelihood functions is called a likelihood ratio, Λ. For example,

\Lambda_E = \frac{L(H|E)}{L(\mathrm{not}\,H|E)} = \frac{P(E|H)}{P(E|\mathrm{not}\,H)} ,

where the dependence of ΛE on H is suppressed for simplicity (as E might have been, except we will need to use that parameter below).

Since H and not-H are mutually exclusive and span all possibilities, the sum previously given for the marginal probability reduces to P(E) = P(E|H)\,P(H)+P(E|\mathrm{not}\,H)\,P(\mathrm{not}\,H) . As a result, we can rewrite Bayes' theorem as

P(H|E) = \frac{P(E|H)\,P(H)}{P(E|H)\,P(H)+ P(E|\mathrm{not}\,H)\,P(\mathrm{not}\,H)} = \frac{\Lambda_E P(H)}{\Lambda_E P(H) +P(\mathrm{not}\,H)}.

We could then exploit the identity P(\mathrm{not}\,H) = 1 - P(H)to exhibit P(H | E) as a function of just P(H) (and ΛE, which is computed directly from the evidence).

With two independent pieces of evidence E1 and E2, Bayesian inference can be applied iteratively. We could use the first piece of evidence to calculate an initial posterior probability, and then use that posterior probability as a new prior probability to calculate a second posterior probability given the second piece of evidence. Bayes' theorem applied iteratively yields

P(H|E_1 \cap E_2) = \frac{P(E_2|H)\;P(E_1|H)\,P(H)}{P(E_2)\;P(E_1)}

Using likelihood ratios, we find that

P(H|E_1 \cap E_2) = \frac{\Lambda_1 \Lambda_2 P(H)}{[\Lambda_1 P(H) + P(\mathrm{not}\,H)]\;[\Lambda_2 P(H) + P(\mathrm{not}\,H)]} ,

This iteration of Bayesian inference could be extended with more independent pieces of evidence.

Bayesian inference is used to calculate probabilities for decision making under uncertainty. Besides the probabilities, a loss function should be evaluated to take into account the relative impact of the alternatives.

[edit] Simple examples of Bayesian inference

[edit] From which bowl is the cookie?

To illustrate, suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?

Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let H1 correspond to bowl #1, and H2 to bowl #2. It is given that the bowls are identical from Fred's point of view, thus P(H1) = P(H2), and the two must add up to 1, so both are equal to 0.5. The event E is the observation of a plain cookie. From the contents of the bowls, we know that P(E | H1) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5. Bayes' formula then yields

\begin{matrix} P(H_1|E) &=& \frac{P(E|H_1)\,P(H_1)}{P(E|H_1)\,P(H_1)\;+\;P(E|H_2)\,P(H_2)} \\  \\  \ & =& \frac{0.75 \times 0.5}{0.75 \times 0.5 + 0.5 \times 0.5} \\  \\  \ & =& 0.6 \end{matrix}

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, P(H1), which was 0.5. After observing the cookie, we must revise the probability to P(H1 | E), which is 0.6.

[edit] False positives in a medical test

False positives result when a test falsely or incorrectly reports a positive result. For example, a medical test for a disease may return a positive result indicating that patient has a disease even if the patient does not have the disease. We can use Bayes' theorem to determine the probability that a positive result is in fact a false positive. We find that if a disease is rare, then the majority of positive results may be false positives, even if the test is accurate.

Suppose that a test for a disease generates the following results:

  • If a tested patient has the disease, the test returns a positive result 99% of the time, or with probability 0.99
  • If a tested patient does not have the disease, the test returns a positive result 5% of the time, or with probability 0.05.

Naively, one might think that only 5% of positive test results are false, but that is quite wrong, as we shall see.

Suppose that only 0.1% of the population has that disease, so that a randomly selected patient has a 0.001 prior probability of having the disease.

We can use Bayes' theorem to calculate the probability that a positive test result is a false positive.

Let A represent the condition in which the patient has the disease, and B represent the evidence of a positive test result. Then, probability that the patient actually has the disease given the positive test result is

\begin{matrix} P(A | B) &=& \frac{P(B | A) P(A)}{P(B | A)P(A) + P(B |\mathrm{not}\,A)P(\mathrm{not}\,A)} \\ \\   &= &\frac{0.99\times 0.001}{0.99 \times 0.001 + 0.05\times 0.999}  \\ ~\\ &\approx &0.019 .\end{matrix}

and hence the probability that a positive result is a false positive is about 1 − 0.019 = 0.98, or 98%.

Despite the apparent high accuracy of the test, the incidence of the disease is so low that the vast majority of patients who test positive do not have the disease. Nonetheless, the fraction of patients who test positive who do have the disease (.019) is 19 times the fraction of people who have not yet taken the test who have the disease (.001). Thus the test is not useless, and re-testing may improve the reliability of the result.

In order to reduce the problem of false positives, a test should be very accurate in reporting a negative result when the patient does not have the disease. If the test reported a negative result in patients without the disease with probability 0.999, then

P(A|B) = \frac{0.99\times 0.001}{0.99 \times 0.001 + 0.001\times 0.999} \approx 0.5 ,

so that 1 − 0.5 = 0.5 now is the probability of a false positive.

On the other hand, false negatives result when a test falsely or incorrectly reports a negative result. For example, a medical test for a disease may return a negative result indicating that patient does not have a disease even though the patient actually has the disease. We can also use Bayes' theorem to calculate the probability of a false negative. In the first example above,

\begin{matrix} P(A |\mathrm{not}\,B) &=& \frac{P(\mathrm{not}\,B | A) P(A)}{P(\mathrm{not}\,B | A)P(A) + P(\mathrm{not}\,B |\mathrm{not}\,A)P(\mathrm{not}\,A)} \\ \\   &= &\frac{0.01\times 0.001}{0.01 \times 0.001 + 0.95\times 0.999}\, ,\\ ~\\ &\approx &0.0000105\, .\end{matrix}

The probability that a negative result is a false negative is about 0.0000105 or 0.00105%. When a disease is rare, false negatives will not be a major problem with the test.

But if 60% of the population had the disease, then the probability of a false negative would be greater. With the above test, the probability of a false negative would be

\begin{matrix} P(A |\mathrm{not}\,B) &=& \frac{P(\mathrm{not}\,B | A) P(A)}{P(\mathrm{not}\,B | A)P(A) + P(\mathrm{not}\,B |\mathrm{not}\,A)P(\mathrm{not}\,A)} \\ \\   &= &\frac{0.01\times 0.6}{0.01 \times 0.6 + 0.95\times 0.4}\, ,\\ ~\\ &\approx &0.0155\, .\end{matrix}

The probability that a negative result is a false negative rises to 0.0155 or 1.55%.

[edit] In the courtroom

Bayesian inference can be used in a court setting by an individual juror to coherently accumulate the evidence for and against the guilt of the defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'.

  • Let G denote the event that the defendant is guilty.
  • Let E denote the event that the defendant's DNA matches DNA found at the crime scene.
  • Let P(E | G) denote the probability of seeing event E if the defendant actually is guilty. (Usually this would be taken to be near unity.)
  • Let P(G | E) denote the probability that the defendant is guilty assuming the DNA match (event E).
  • Let P(G) denote the juror's personal estimate of the probability that the defendant is guilty, based on the evidence other than the DNA match. This could be based on his responses under questioning, or previously presented evidence.

Bayesian inference tells us that if we can assign a probability p(G) to the defendant's guilt before we take the DNA evidence into account, then we can revise this probability to the conditional probability P(G | E), since

P(G | E) = \frac{P(G) P(E | G)}{P(E)}.

Suppose, on the basis of other evidence, a juror decides that there is a 30% chance that the defendant is guilty. Suppose also that the forensic testimony was that the probability that a person chosen at random would have DNA that matched that at the crime scene is 1 in a million, or 10−6.

The event E can occur in two ways. Either the defendant is guilty (with prior probability 0.3) and thus his DNA is present with probability 1, or he is innocent (with prior probability 0.7) and he is unlucky enough to be one of the 1 in a million matching people.

Thus the juror could coherently revise his opinion to take into account the DNA evidence as follows:

P(G | E) = (0.3 \times 1.0) /(0.3 \times 1.0 + 0.7 \times 10^{-6}) = 0.99999766667.

The benefit of adopting a Bayesian approach is that it gives the juror a formal mechanism for combining the evidence presented. The approach can be applied successively to all the pieces of evidence presented in court, with the posterior from one stage becoming the prior for the next.

The juror would still have to have a prior estimate for the guilt probability before the first piece of evidence is considered. It has been suggested that this could reasonably be the guilt probability of a random person taken from the qualifying population. Thus, for a crime known to have been committed by an adult male living in a town containing 50,000 adult males, the appropriate initial prior probability might be 1/50,000.

Adding up evidence.

Adding up evidence.

For the purpose of explaining Bayes' theorem to jurors, it will usually be appropriate to give it in the form of betting odds rather than probabilities, as these are more widely understood. In this form Bayes' theorem states that

Posterior odds = prior odds x Bayes factor

In the example above, the juror who has a prior probability of 0.3 for the defendant being guilty would now express that in the form of odds of 3:7 in favour of the defendant being guilty, the Bayes factor is one million, and the resulting posterior odds are 3 million to 7 or about 429,000 to one in favour of guilt.

A logarithmic approach which replaces multiplication with addition and reduces the range of the numbers involved might be easier for a jury to handle. This approach, developed by Alan Turing during World War II and later promoted by I. J. Good and E. T. Jaynes among others, amounts to the use of information entropy.

In the United Kingdom, Bayes' theorem was explained to the jury in the odds form by a statistician expert witness in the rape case of Regina versus Denis John Adams. A conviction was secured but the case went to Appeal, as no means of accumulating evidence had been provided for those jurors who did not want to use Bayes' theorem. The Court of Appeal upheld the conviction, but also gave their opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the Jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task." No further appeal was allowed and the issue of Bayesian assessment of forensic DNA data remains controversial.

Gardner-Medwin argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist p-value). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:

A: The known facts and testimony could have arisen if the defendant is guilty,

B: The known facts and testimony could have arisen if the defendant is innocent,

C: The defendant is guilty.

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

Other court cases in which probabilistic arguments played some role were the Howland will forgery trial, the Sally Clark case, and the Lucia de Berk case.

[edit] Search theory

Main article: Bayesian search theory

In May 1968 the US nuclear submarine Scorpion (SSN-589) failed to arrive as expected at her home port of Norfolk, Virginia. The US Navy was convinced that the vessel had been lost off the Eastern seaboard but an extensive search failed to discover the wreck. The US Navy's deep water expert, John Craven USN, believed that it was elsewhere and he organised a search south west of the Azores based on a controversial approximate triangulation by hydrophones. He was allocated only a single ship, the Mizar, and he took advice from a firm of consultant mathematicians in order to maximise his resources. A Bayesian search methodology was adopted. Experienced submarine commanders were interviewed to construct hypotheses about what could have caused the loss of the Scorpion.

The sea area was divided up into grid squares and a probability assigned to each square, under each of the hypotheses, to give a number of probability grids, one for each hypothesis. These were then added together to produce an overall probability grid. The probability attached to each square was then the probability that the wreck was in that square. A second grid was constructed with probabilities that represented the probability of successfully finding the wreck if that square were to be searched and the wreck were to be actually there. This was a known function of water depth. The result of combining this grid with the previous grid is a grid which gives the probability of finding the wreck in each grid square of the sea if it were to be searched.

This sea grid was systematically searched in a manner which started with the high probability regions first and worked down to the low probability regions last. Each time a grid square was searched and found to be empty its probability was reassessed using Bayes' theorem. This then forced the probabilities of all the other grid squares to be reassessed (upwards), also by Bayes' theorem. The use of this approach was a major computational challenge for the time but it was eventually successful and the Scorpion was found about 740 kilometers southwest of the Azores in October of that year.

Suppose a grid square has a probability p of containing the wreck and that the probability of successfully detecting the wreck if it is there is q. If the square is searched and no wreck is found, then, by Bayes' theorem, the revised probability of the wreck being in the square is given by

  p' = \frac{p(1-q)}{(1-p)+p(1-q)}.

[edit] More mathematical examples

[edit] Naive Bayes classifier

See naive Bayes classifier.

[edit] Posterior distribution of the binomial parameter

In this example we consider the computation of the posterior distribution for the binomial parameter. This is the same problem considered by Bayes in Proposition 9 of his essay.

We are given m observed successes and n observed failures in a binomial experiment. The experiment may be tossing a coin, drawing a ball from an urn, or asking someone their opinion, among many other possibilities. What we know about the parameter (let's call it a) is stated as the prior distribution, p(a).

For a given value of a, the probability of m successes in m+n trials is

 p(m,n|a) = \begin{pmatrix} n+m \\ m \end{pmatrix} a^m (1-a)^n.

Since m and n are fixed, and a is unknown, this is a likelihood function for a. From the continuous form of the law of total probability we have

 p(a|m,n) = \frac{p(m,n|a)\,p(a)}{\int_0^1 p(m,n|a)\,p(a)\,da}      = \frac{\begin{pmatrix} n+m \\ m \end{pmatrix} a^m (1-a)^n\,p(a)}          {\int_0^1 \begin{pmatrix} n+m \\ m \end{pmatrix} a^m (1-a)^n\,p(a)\,da}.

For some special choices of the prior distribution p(a), the integral can be solved and the posterior takes a convenient form. In particular, if p(a) is a beta distribution with parameters m0 and n0, then the posterior is also a beta distribution with parameters m+m0 and n+n0.

A conjugate prior is a prior distribution, such as the beta distribution in the above example, which has the property that the posterior is the same type of distribution.

What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the parameter a. That is, not only can one compute probabilities for experimental outcomes, but also for the parameter which governs them, and the same algebra is used to make inferences of either kind. Interestingly, Bayes actually states his question in a way that might make the idea of assigning a probability distribution to a parameter palatable to a frequentist. He supposes that a billiard ball is thrown at random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard balls will fall above or below the first ball. By making the binomial parameter a depend on a random event, he cleverly escapes a philosophical quagmire that was an issue he most likely was not even aware of.

[edit] Computer applications

Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while the graphical model structure inherent to statistical models, may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis-Hastings algorithm schemes. Recently Bayesian inference has gained popularity amongst the phylogenetics community for these reasons; applications such as BEAST, MrBayes and P4 allow many demographic and evolutionary parameters to be estimated simultaneously.

As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying unsolicited bulk e-mail spam. Applications which make use of Bayesian inference for spam filtering include DSPAM, Bogofilter, SpamAssassin, InBoxer, and Mozilla. Spam classification is treated in more detail in the article on the naive Bayes classifier.

In some applications fuzzy logic is an alternative to Bayesian inference. Fuzzy logic and Bayesian inference, however, are mathematically and semantically not compatible: You cannot, in general, understand the degree of truth in fuzzy logic as probability and vice versa.

[edit] References

  1. ^ Douglas Hubbard "How to Measure Anything: Finding the Value of Intangibles in Business" pg. 46, John Wiley & Sons, 2007
  • On-line textbook: Information Theory, Inference, and Learning Algorithms, by David MacKay, has chapters on Bayesian methods, including examples; arguments in favour of Bayesian methods (in the style of Edwin Jaynes); modern Monte Carlo methods, message-passing methods, and variational methods; and examples illustrating the connections between Bayesian inference and data compression.
  • Berger, J.O. (1999) Statistical Decision Theory and Bayesian Analysis. Second Edition. Springer Verlag, New York. ISBN 0-387-96098-8 and also ISBN 3-540-96098-8.
  • Bolstad, William M. (2004) Introduction to Bayesian Statistics, John Wiley ISBN 0-471-27020-2
  • Bretthorst, G. Larry, 1988, Bayesian Spectrum Analysis and Parameter Estimation in Lecture Notes in Statistics, 48, Springer-Verlag, New York, New York
  • Carlin, B.P. and Louis, T.A. (2008) Bayesian Methods for Data Analysis, Third Edition. Chapman & Hall/CRC. [1]
  • Dawid, A.P. and Mortera, J. (1996) Coherent analysis of forensic identification evidence. Journal of the Royal Statistical Society, Series B, 58,425-443.
  • Foreman, L.A; Smith, A.F.M. and Evett, I.W. (1997). Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion). Journal of the Royal Statistical Society, Series A, 160, 429-469.
  • Gardner-Medwin, A. What probability should the jury address?. Significance. Volume 2, Issue 1, March 2005
  • Gelman, A., Carlin, J., Stern, H., and Rubin, D.B. (2003). Bayesian Data Analysis. Second Edition. Chapman & Hall/CRC, Boca Raton, Florida. [2] ISBN 1-58488-388-X.
  • Gelman, A. and Meng, X.L. (2004). Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives: an essential journey with Donald Rubin's statistical family. John Wiley & Sons, Chichester, UK. ISBN 0-470-09043-X
  • Giffin, A. and Caticha, A. (2007) Updating Probabilities with Data and Moments
  • Jaynes, E.T. (1998) Probability Theory: The Logic of Science.
  • Lee, Peter M. Bayesian Statistics: An Introduction. Second Edition. (1997). ISBN 0-340-67785-6.
  • Loredo, Thomas J. (1992) "Promise of Bayesian Inference in Astrophysics" in Statistical Challenges in Modern Astronomy, ed. Feigelson & Babu.
  • O'Hagan, A. and Forster, J. (2003) Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold, New York. ISBN 0-340-52922-9.
  • Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems, San Mateo, CA: Morgan Kaufmann.
  • Robert, C.P. (2001) The Bayesian Choice. Springer Verlag, New York.
  • Robertson, B. and Vignaux, G.A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. John Wiley and Sons. Chichester.
  • Winkler, Robert L, Introduction to Bayesian Inference and Decision, 2nd Edition (2003) Probabilistic. ISBN 0-9647938-4-9

nternal consistency

From Wikipedia, the free encyclopedia

Jump to: navigation, search

In statistics and research, internal consistency is a measure based on the correlations between different items on the same test (or the same subscale on a larger test). It measures whether several items that propose to measure the same general construct produce similar scores. For example, if a respondent expressed agreement with the statements "I like to ride bicycles" and "I've enjoyed riding bicycles in the past", and disagreement with the statement "I hate bicycles", this would be indicative of good internal consistency of the test.

Internal consistency is usually measured with Cronbach's alpha, a statistic calculated from the pairwise correlations between items. See the article for Cronbach's alpha for extensive details and applications. Internal consistency ranges between zero and one. A commonly-accepted rule of thumb is that an α of 0.6-0.7 indicates acceptable reliability, and 0.8 or higher indicates good reliability. High reliabilities (0.95 or higher) are not necessarily desirable, as this indicates that the items may entirely redundant. The goal in designing a reliable instrument is for scores on similar items to be related (internally consistent), but for each to contribute some unique information as well.

Cronbach's alpha

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Cronbach's α (alpha) is a statistic. It has an important use as a measure of the reliability of a psychometric instrument. It was first named as alpha by Cronbach (1951), as he had intended to continue with further instruments. It is the extension of an earlier version, the Kuder-Richardson Formula 20 (often shortened to KR-20), which is the equivalent for dichotomous items, and Guttman (1945) developed the same quantity under the name lambda-2.

Contents

[hide]

[edit] Definition

Cronbach's α is defined as

\alpha = { { {N} \over{N-1} } \left(1 - {{\sum_{i=1}^N \sigma^{2}_{Y_i}}\over{\sigma^{2}_{X}}}\right) }

where N is the number of components (items or testlets), \sigma^{2}_{X}is the variance of the observed total test scores, and \sigma^{2}_{Y_i}is the variance of component i.

Alternatively, the standardized Cronbach's α can also be defined as

\alpha = {N\cdot\bar c \over (\bar v + (N-1)\cdot\bar c)}

where N is the number of components (items or testlets), \bar vequals the average variance and \bar cis the average of all covariances between the components.

[edit] Cronbach's alpha and internal consistency

Cronbach's alpha will generally increase when the correlations between the items increase. For this reason the coefficient is also called the internal consistency or the internal consistency reliability of the test.

[edit] Cronbach's alpha in classical test theory

Alpha is an unbiased estimator of reliability if and only if the components are essentially τ-equivalent (Lord & Novick, 1968[1]). Under this condition the components can have different means and different variances, but their covariances should all be equal - which implies that they have 1 common factor in a factor analysis. One special case of essential τ-equivalence is that the components are parallel. Although the assumption of essential τ-equivalence may sometimes be met (at least approximately) by testlets, when applied to items it is probably never true. This is caused by the facts that (1) most test developers invariably include items with a range of difficulties (or stimuli that vary in their standing on the latent trait, in the case of personality, attitude or other non-cognitive instruments), and (2) the item scores are usually bounded from above and below. These circumstances make it unlikely that the items have a linear regression on a common factor. A factor analysis may then produce artificial factors that are related to the differential skewnesses of the components. When the assumption of essential τ-equivalence of the components is violated, alpha is not an unbiased estimator of reliability. Instead, it is a lower bound on reliability.

α can take values between negative infinity and 1 (although only positive values make sense). Some professionals, as a rule of thumb, require a reliability of 0.70 or higher (obtained on a substantial sample) before they will use an instrument. Obviously, this rule should be applied with caution when α has been computed from items that systematically violate its assumptions. Further, the appropriate degree of reliability depends upon the use of the instrument, e.g., an instrument designed to be used as part of a battery may be intentionally designed to be as short as possible (and thus somewhat less reliable). Other situations may require extremely precise measures (with very high reliabilities).

Cronbach's α is related conceptually to the Spearman-Brown prediction formula. Both arise from the basic classical test theory result that the reliability of test scores can be expressed as the ratio of the true score and total score (error and true score) variances:

\rho_{XX}= { {\sigma^2_T}\over{\sigma_X^2} }

Alpha is most appropriately used when the items measure different substantive areas within a single construct. Conversely, alpha (and other internal consistency estimates of reliability) are inappropriate for estimating the reliability of an intentionally heterogeneous instrument (such as screening device such as a biodata or the original MMPI). Also, α can be artificially inflated by making scales which consist of superficial changes to the wording within a set of items or by analyzing speeded tests.

[edit] Cronbach's alpha in generalizability theory

Cronbach and others generalized some basic assumptions of classical test theory in their generalizability theory. If this theory is applied to test construction, then it is assumed that the items that constitute the test are a random sample from a larger universe of items. The expected score of a person in the universe is called the universum score, analogous to a true score. The generalizability is defined analogously as the variance of the universum scores divided by the variance of the observable scores, analogous to the concept of reliability in classical test theory. In this theory, Cronbach's alpha is an unbiased estimate of the generalizability. For this to be true the assumptions of essential τ-equivalence or parallelness are not needed. Consequently, Cronbach's alpha can be viewed as a measure of how well the sum score on the selected items capture the expected score in the entire domain, even if that domain is heterogeneous.

[edit] Cronbach's alpha and the intra-class correlation

Cronbach's alpha is equal to the stepped-up consistency version of the Intra-class correlation coefficient, which is commonly used in observational studies. This can be viewed as another application of generalizability theory, where the items are replaced by raters or observers who are randomly drawn from a population. Cronbach's alpha will then estimate how strongly the score obtained from the actual panel of raters correlates with the score that would have been obtained by another random sample of raters.

[edit] Cronbach's alpha and factor analysis

As stated in the section about its relation with classical test theory, Cronbach's alpha has a theoretical relation with factor analysis. There is also a more empirical relation: Selecting items such that they optimize Cronbach's alpha will often result in a test that is homogeneous in that they (very roughly) approximately satisfy a factor analysis with one common factor. The reason for this is that Cronbach's alpha increases with the average correlation between item, so optimization of it tends to select items that have correlations of similar size with most other items. It should be stressed that, although unidimensionality (i.e. fit to the one-factor model) is a necessary condition for alpha to be an unbiased estimator of reliability, the value of alpha is not related to the factorial homogeneity. The reason is that the value of alpha depends on the size of the average inter-item covariance, while unidimensionality depends on the pattern of the inter-item covariances.

[edit] Cronbach's alpha and other disciplines

Although this description of the use of α is given in terms of psychology, the statistic can be used in any discipline.

[edit] Construct creation

Coding two (or more) different variables with a high Cronbach's alpha into a construct for regression use is simple. Dividing the used variables by their means or averages results in a percentage value for the respective case. After all variables have been re-calculated in percentage terms, they can easily be summed to create the new construct.

[edit] References

  1. ^ Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Wesley Publishing Company.
  • Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334.
  • Allen, M.J., & Yen, W. M. (2002). Introduction to Measurement Theory. Long Grove, IL: Waveland Press.

Consistency

From Wikipedia, the free encyclopedia

Jump to: navigation, search

For other uses, see Consistency (disambiguation).

In logic, a theory is consistent is if does not contain a contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent if it has a model; this is the sense used in traditional Aristotelian logic, although in contemporary mathematical logic the term satisfiable is used instead. The syntactic definition states that a theory is consistent if there is no formula P such that both P and its negation are provable from the axioms of the theory under its associated deductive system.

If these semantic and syntactic definitions are equivalent for a particular logic, the logic is complete. The completeness of sentential calculus was proved by Paul Bernays in 1918 and Emil Post in 1921, while the completeness of predicate calculus was proved by Kurt Gödel in 1930. Stronger logics, such as second-order logic, are not complete.

A consistency proof is a mathematical proof that a particular theory is consistent. The early development of mathematical proof theory was driven by the desire to provide finitary consistency proofs for all of mathematics as part of Hilbert's program. Hilbert's program was strongly impacted by incompleteness theorems, which showed that sufficiently strong proof theories cannot prove their own consistency.

Although consistency can be proved by means of model theory, it is often done in a purely syntactical way, without any need to reference some model of the logic. The cut-elimination (or equivalently the normalization of the underlying calculus if there is one) implies the consistency of the calculus: since there is obviously no cut-free proof of falsity, there is no contradiction in general.

Contents

[hide]

Consistency and completeness

The fundamental results relating consistency and completeness were proven by Kurt Gödel:

By applying these ideas, we see that we can find first-order theories of the following four kinds:

  1. Inconsistent theories, which have no models;
  2. Theories which cannot talk about their own provability relation, such as Tarski's axiomatisation of point and line geometry, and Presburger arithmetic. Since these theories are satisfactorily described by the model we obtain from the completeness theorem, such systems are complete;
  3. Theories which can talk about their own consistency, and which include the negation of the sentence asserting their own consistency. Such theories are complete with respect to the model one obtains from the completeness theorem, but contain as a theorem the derivability of a contradiction, in contradiction to the fact that they are consistent;
  4. Essentially incomplete theories.

In addition, it has recently been discovered that there is a fifth class of theory, the self-verifying theories, which are strong enough to talk about their own provability relation, but are too weak to carry out Gödelian diagonalisation, and so which can consistently prove their own consistency. However as with any theory, a theory proving its own consistency provides us with no interesting information, since inconsistent theories also prove their own consistency.

[edit] Formulas

A set of formulas Φ in first-order logic is consistent (written ConΦ) if and only if there is no formula φ such that \Phi \vdash \phiand \Phi \vdash \lnot\phi. Otherwise Φ is inconsistent and is written IncΦ.

Φ is said to be simply consistent iff for no formula φ of Φ are both φ and the negation of φ theorems of Φ.

Φ is said to be absolutely consistent or Post consistent iff at least one formula of Φ is not a theorem of Φ.

Φ is said to be maximally consistent if and only if for every formula φ, if Con \Phi \cup \phithen \phi \in \Phi.

Φ is said to contain witnesses if and only if for every formula of the form \exists x \phithere exists a term t such that (\exists x \phi \to \phi {t \over x}) \in \Phi. See First-order logic.

[edit] Basic results

1. The following are equivalent:

(a) IncΦ

(b) For all \phi,\; \Phi \vdash \phi.

2. Every satisfiable set of formulas is consistent, where a set of formulas Φ is satisfiable if and only if there exists a model \mathfrak{I}such that \mathfrak{I} \vDash \Phi .

3. For all Φ and φ:

(a) if not  \Phi \vdash \phi, then Con \Phi \cup \{\lnot\phi\};

(b) if Con Φ and \Phi \vdash \phi, then Con \Phi \cup \{\phi\};

(c) if Con Φ, then Con \Phi \cup \{\phi\} or Con \Phi \cup \{\lnot \phi\}.

4. Let Φ be a maximally consistent set of formulas and contain witnesses. For all φ and ψ:

(a) if  \Phi \vdash \phi, then \phi \in \Phi,

(b) either \phi \in \Phior \lnot \phi \in \Phi,

(c) (\phi \or \psi) \in \Phiif and only if \phi \in \Phior \psi \in \Phi,

(d) if (\phi\to\psi) \in \Phiand \phi \in \Phi , then \psi \in \Phi,

(e) \exists x \phi \in \Phiif and only if there is a term t such that \phi{t \over x}\in\Phi.

[edit] Henkin's theorem

Let Φ be a maximally consistent set of formulas containing witnesses.

Define a binary relation on the set of S-terms  t_0 \sim t_1 \!if and only if \; t_0 = t_1 \in \Phi; and let \overline t \!denote the equivalence class of terms containing t \!; and let T_{\Phi} := \{ \; \overline t \; |\; t \in T^S \} where T^S \!is the set of terms based on the symbol set S \!.

Define the S-structure \mathfrak T_{\Phi} over  T_{\Phi} \!the term-structure corresponding to Φ by:

(1) For n-ary R \in S, R^{\mathfrak T_{\Phi}} \overline {t_0} \ldots \overline {t_{n-1}}if and only if \; R t_0 \ldots t_{n-1} \in \Phi,

(2) For n-ary f \in S, f^{\mathfrak T_{\Phi}} (\overline {t_0} \ldots \overline {t_{n-1}}) := \overline {f t_0 \ldots t_{n-1}},

(3) For c \in S, c^{\mathfrak T_{\Phi}}:= \overline c.

Let \mathfrak I_{\Phi} := (\mathfrak T_{\Phi},\beta_{\Phi})be the term interpretation associated with Φ, where \beta _{\Phi} (x) := \bar x.

(*) \;For all φ,\; \mathfrak I_{\Phi} \vDash \phi if and only if  \; \phi \in \Phi.

[edit] Sketch of proof

There are several things to verify. First, that \simis an equivalence relation. Then, it needs to be verified that (1), (2), and (3) are well defined. This falls out of the fact that \simis an equivalence relation and also requires a proof that (1) and (2) are independent of the choice of  t_0, \ldots ,t_{n-1} class representatives. Finally,  \mathfrak I_{\Phi} \vDash \Phi can be verified by induction on formulas.

[edit] See also

[edit] References

No comments: