Little Paul (that is me when I was a kid) listened to my father preach and teach many passages from the Bible. One impressionable passage Dad taught from was on King Solomon and his wise judgement when two women came to him claiming the other had stolen her child (
I Kings 3:16-28). Being a busy potentate and one not all that interested in hearing the two mother's bicker back and forth all day, King Solomon suggested cutting the baby in half and letting them each take a half. Lo and behold, one woman saw this as a reasonable solution while the other was mortified at the suggestion. King Solomon's verdict: the true mother is the one whose immediate response is to protect the baby. The King was a wise judge; he elicited disparate responses from the two women and ruled based on a good hunch of how a true mother behaves. There is a very small probability that he awarded the baby to the non-biological mother (but, even with such an erroneous judgement he would have given the baby to a woman who wants to protect it and not cut it in half).
This case study in primitive law gets at the center of Fisher's p-value. Fisher, working largely in English agriculture, utilized probabilities (i.e.: p-values) to calculate how likely disparate responses would be among different crop samples under separate conditions for each sample. To do this he had to precede the statistical analysis with assumptions (or hunches) to draw conclusions (or judgments). For Fisher, the foundation was not math, but philosophical commitment to a body of prior learned knowledge that led to a priori presuppositions (beliefs held prior to investigation through which results are interpreted) and a posteriori inductions. Below I will touch lightly on the mathematical reasoning describing these commitments. These commitments are described by two terms: level of significance (a priori commitment) and p-values (a posteriori commitment).
Level of significance is a statistical term that reflects the probability of drawing a false conclusion. The closer the number is to zero, the lower the probability of drawing a false conclusion about groups being different; the closer the number is to one, the higher the probability of drawing a false conclusion. A 0.05 level of significance indicates that their is a 5% chance that a difference has been observed between two groups when in reality no difference exists (referred to as a type I error); a 0.10 level of significance means 10% chance of type I error, etc. This value can also be thought of as a false positive as demonstrated by the below 2 x 2 contingency table of hypothetical results versus hypothetical reality:
| Hypothetical Reality #1: Drug A = Drug B | Hypothetical Reality #2: Drug A ≠ Drug B |
Hypothetical Result #1: Drug A ≠ Drug B | False Positive (This is termed “Level of Significance” and is synonymous with a Type I error...observing a difference when no difference truly exists.) | True Positive (This is termed “Power” because it is the power to find a true difference) |
Hypothetical Result #2: Drug A = Drug B | True Negative (This is termed “Level of Confidence”) | False Negative (This is the definition of a Type II error...failing to find a difference when one truly exists) |
Note that this is all hypothetical. The researcher sets this level of significance prior to the experiment; typically level of significance is set at 0.05. Then the researcher calculates a test statistic called a p-value. Any analysis that yields a p-value of less than 0.05 is considered significant and adds to the evidence that a true difference exists between groups. The p-value can only be obtained by assuming that the condition in box one (false positive) exists. That is, a p-value is calculated based on the assumption that in truth there is no difference between groups who have received different treatments. In other words, if two groups are treated with different diuretic drugs for high blood pressure, the researcher has a hunch that one drug is better than the other. To test the probability that his drug is superior to the other, he either must have access to the truth (which is not available in this kind of research design) or he must assume that there is no difference between the drugs and then test the observed difference against that assumption of no difference. So if he observes that his drug lowers blood pressure 10 points in the experimental groups and the other drug lowers blood pressure only 5 points in the control group, then the difference between the two drugs would be five (that is: 10-5 = 5). Assuming that there truly is no difference (hypothetical condition for box 1 in the above contingency table) he then subtracts the assumed difference (zero) from the observed difference (five) and divides this number by the standard error of the difference between both group means. Remember, standard error is a term used to reflect standard deviation of many experimental/control differences from the true mean if the experiment were repeated with random sampling from the population a large number of times. If the observed difference (5 in our example) is much larger than standard error (lets say standard error is 1 in our example), then the statistic will be large. The statistic can then be utilized to calculate a probability of observing the difference of 5 when the hypothetical reality of no difference is true. This is done by finding the test value on a normal bell curve and determining the area under the curve between that value and the left end of the curve. This area value is then the divided by the value of the entire area under the curve. The results in a probability of finding the difference between the two groups assuming that in truth there is no difference.
If any of you are still reading at this point, no doubt you are waiting for the point. Bottom line, the p-value, when smaller than the pre-set level of significance gives the researcher evidence that drug A truly is more effective than drug B. This is not a mathematical proof of the truth (only because the true situation is unknown), but rather mathematical support (or evidence) for one drug being better than the other. There may be Bayesian tricks (a la Neyman and Pearson) to guess at the true situation (termed prior probability) and then compare the results to this prior probability, however, even prior probability is based on
a priori determination of values and so does not escape some personal judgement in the analysis. Fisher's p-value adds evidence one study at a time to slowly and over time produce a reliable body of knowledge. No one study creates definitive knowledge (diagnostic studies excluded). The best stewards of science recognize the valid role of personal judgement in arriving at true knowledge (see
Michael Polanyi).