Talk:P-value/Archive 2

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Lead paragraph

The lead paragraph is unintelligible.drh (talk) 17:40, 13 April 2015 (UTC)

How about starting with something along these lines? In statistics, a p-value is a measure of how likely an observed result is to have occurred by random chance alone (under certain assumptions). Thus low p values indicate unlikely results that suggest a need for further explanations (beyond random chance). Tayste (edits) 23:22, 13 April 2015 (UTC)

The problem with that definition is that it is wrong. — Preceding unsigned comment added by 108.184.177.134 (talk) 15:50, 11 May 2015 (UTC)

How is it wrong? I'm a statistics layperson, so I could be missing something, but it's not clear what the objection is. When P < .05, that means that there's less than a 5% chance that the null hypothesis is true. Which means there's less than a 5% chance that the result would have happened anyway regardless of the correlations that the experimenters think/hope/believe/suspect are involved. Please explain the allegation of wrongness. Quercus solaris (talk) 21:55, 12 May 2015 (UTC)
The "probability that significance will occur when the null hypothesis is true" is not the same thing as the "probability that the null hypothesis is true when significance has occurred," just as the probability of someone dying if they are struck by lightning (presumably quite high) is not the same as the probability that someone was struck by lightning if they are dead (presumably quite low). More generally, the probability of A given B is not the same as the probability of B given A. Consider a null hypothesis that is almost certainly true, such as "The color of my underwear on a given day does not affect the temperature on the surface of Mars." If you do a study to test that null hypothesis, 1 in 20 times the p-value will be less than .05. When that happens, by your logic we should conclude that there is a greater than 95% chance that the color of your underwear affects the temperature on the surface of Mars! I hope that answers your question. — Preceding unsigned comment added by 108.184.177.134 (talk) 14:35, 13 May 2015 (UTC)
I like the approach of thinking critically about a null hypothesis that is almost certainly true (more like "damn well has to be true"). But the way you frame it does not match how P is usually used in journal articles nor what was described at the start of the thread. P is never "the probability of significance". P is the probability of the null hypothesis being true. (Full stop.) Regarding "the probability that the null hypothesis is true when significance has occurred," that value is, by definition of significance, always less than the significance level (which is usually 5%). Significance is a yes/no (significant or not) dichotomy with the threshold set at a value of P (at heart an arbitrary value, although a sensibly chosen one). Therefore, when you mention "a greater than 95% chance that the color of your underwear affects the temperature on the surface of Mars", which means "a greater than 95% chance that the null hypothesis is false", it has nothing to do with any logic that I or User:Tayste mentioned. When a journal article reporting a medical RCT says that "significance was set at P < .05", what it is saying is that "a result was considered significant only if there was a less than 5% chance of it occurring regardless of the treatment." So your thought experiment doesn't gibe with how P is normally used, but I like your critical approach. Quercus solaris (talk) 01:03, 14 May 2015 (UTC)

The statement that "P is the probability of the null hypothesis being true" is unequivocally incorrect (though it may be a common misconception). If any journal article interprets p-values that way, that is their error. You stated that my reference to "a greater than 95% chance that the null hypothesis is false" "has nothing to do with any logic that I or User:Tayste mentioned." On the contrary, you referred to "a 5% chance that the null hypothesis is true," which is exactly equivalent to "a 95% chance that the null hypothesis is false," just as a 95% chance of it raining is equivalent to a 5% chance of it not raining. Or more generally, P(A) = 1 - P(not A). These are very elementary concepts of probability that should be understood before attempting to debate about the meaning of p-values. — Preceding unsigned comment added by 99.47.244.244 (talk) 21:25, 14 May 2015 (UTC)

I'm sorry, I see now how I was wrong and I see that your underlying math/logic points are correct, even though I still think the words they were phrased in, particularly how significance is mentioned, are not saying it right (correct intent but wording not right). Even though I was correct in the portion "Regarding "the probability that the null hypothesis is true when significance has occurred," that value is, by definition of significance, always less than the significance level (which is usually 5%). Significance is a yes/no (significant or not) dichotomy with the threshold set at a value of P (at heart an arbitrary value, although a sensibly chosen one)."—even though that portion is correct, I was wrong elsewhere. I see that the American Medical Association's Glossary of Statistical Terms says that P is the "probability of obtaining the observed data (or data that are more extreme) if the null hypothesis were exactly true.44(p206)" So one can say: "I choose 5% as my threshold. Assume the null hypothesis *is* exactly true. Then P = .05 means that the probability of getting this data distribution is 5%." But, as you correctly pointed out with your "1 in 20 times", if you run the experiment 100 times, you should expect to get that data distribution around 5 times anyway. So you can't look at one of those runs and find out anything about the null hypothesis's truth. But my brain was jumbling it into something like "If your data isn't junk (i.e., if your methods were valid) and you got an extreme-ish distribution, the observed distribution (observed = did happen) had only a 5% chance of having that much extremeness if the null hypothesis were true, so the fact that it did have it means that you can be 95% sure that the null hypothesis is false." It's weird, I'm pretty sure a lot of laypeople jumble it that way, but now I am seeing how wrong it is. Wondering why it is common if it is wrong, I think the biggest factor involved is that many experiments only have one run (often they *can* only have one run), and people forget about the "1 in 20 runs" idea. They fixate on looking at that one run's data, and thinking—what? that a lot of meaning can be found there, when it's really not much? I don't know—already out of time to ponder. Sorry to have wasted your time on this. I do regret being remedial and I regret that only a sliver of the populace has a firm and robust grasp of statistics. Most of us, even if we try to teach ourselves by reading about it, quickly reach a point where it might as well be a mad scientist's squiggles on a chalkboard—we get lost in the math formulae. Even I, despite having a high IQ and doing fine in math in K-12, couldn't pass an algebra test anymore—too many years without studying. You were right that I shouldn't even be trying to debate the topic——but the portions where I was right made me feel the need to pursue it, to figure out what's what. Quercus solaris (talk) 22:53, 14 May 2015 (UTC)

P value vs P-value vs p-value vs p value

I am avoiding italics here (because I can't be bothered checking how to render them). I know "P value" varies depending on style guide etc., but I think that we should at least be able to agree that, unless used as a modifier ("the p-value style"), a hyphen is unnecessary. Can we make this change? Pretty please. 36.224.218.156 (talk) 07:36, 7 May 2015 (UTC)

Just to clarify, there are two relevant topics here—one is about establishing a style point in Wikipedia's own style guide (WP:MOS), whereas the other is about this article's encyclopedic coverage about how the term is styled by many other style guides that exist. As for the latter, the coverage is already correct as-is. As for the former, no objection to your preference—the place to get started in proposing it is Wikipedia talk:Manual of Style. From there, it may be handled either at that page or at Wikipedia talk:WikiProject Statistics, depending on who gets involved and how much they care about the styling. Quercus solaris (talk) 00:00, 8 May 2015 (UTC)
Can you do it? Pretty please. 27.246.138.86 (talk) 14:35, 16 May 2015 (UTC)
If inspiration strikes. I lack time to participate in WP:MOS generally, and the people who do participate usually end up with good styling decisions, so it doesn't become necessary to me to get involved. The hyphen in this instance doesn't irritate me (although I would not choose it myself), so I may not find the time to pursue it. But if anyone chooses to bother, I would give a support vote. Quercus solaris (talk) 14:58, 16 May 2015 (UTC)

Original Research claim towards the examples

I do not agree that the examples should be removed. They are examples, so cannot be original research. And they are very useful to illustrate the concept. I am grateful to the person who added them. I do not think that every illustration of a mathematical concept has to be justified by a reference to a textbook where it appears. — Preceding unsigned comment added by 143.210.192.165 (talk) 09:55, 24 September 2015 (UTC)

I was planning to write just about what 143.210.192.165 wrote above. That is, that they are examples to make it easier to understand, but are not original research. They don't try to learn anything that wasn't known, but only make it easier for others to understand what is known. As many parts of statistics are counterintuitive, these examples can be very useful. Gah4 (talk) 00:47, 12 December 2015 (UTC)

The misunderstandings section is redundant and confusing

I don't think we need three different statements to say "the p-value is not the probability that the null hypothesis is true," "the p-value is not the probability that the alternative hypothesis is false," and "the p-value is not the probability that a finding is just a fluke." When a p-value is significant, all those statements are essentially the same. Some of the wording is also very unclear and Bayesian statistics are invoked for seemingly no reason (you don't need Bayes to explain why p-values aren't the probability that a finding is a fluke). I suggest simplifying the section if not scrapping it entirely. — Preceding unsigned comment added by 23.242.207.48 (talk) 07:24, 24 September 2015 (UTC)

P.S. Speaking of redundancy, the phrase "sure certainty" surely and for certain needs to go. For sure. — Preceding unsigned comment added by 23.242.207.48 (talk) 07:28, 24 September 2015 (UTC)

I think it should be kept. It is hard to understand, which is why the examples of misunderstanding are useful. A little redundancy helps drive home the point. The subject is confusing, which is the reason the section is there. Gah4 (talk) 00:50, 12 December 2015 (UTC)

As clear as mud!

To the non-statistician, the terminology ('disprove the null hypothesis') is unclear. The article as it stands (December 2015) is written in the language and demeanor of a statistics text. It would be improved with ... examples ... plain-English explanations... — Preceding unsigned comment added by 193.34.187.245 (talk) 13:48, 14 December 2015 (UTC)

Please correct article!

Yes, the article is not didactic (!!) for non-statistician, and shows absense of focus on usual reader needs, is not encyclopedic without better structure and simplifications.

The lead section needs a basic summary (and this 3 assertions can be reused in all article). EXAMPLE:

  • A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis. The null hypothesis was reject, is good.
  • A large p-value (> 0.05) indicates weak evidence against the null hypothesis. Fail to reject the null hypothesis, is bad.
  • p-values very close to the cutoff (~ 0.05) are considered to be marginal (could go either way)...

Statistical analysis must always report the p-value, so readers can draw their own conclusions.

--Krauss (talk) 06:18, 30 January 2016 (UTC)

Possible misconceptions article

There is a discussion at Talk:P-value_fallacy#Statement_from_the_American_Statistical_Association about turning that article (P-value fallacy) into a more general article about misconceptions around p-values, which clearly impacts on the article here. Input welcome. Bondegezou (talk) 18:08, 9 March 2016 (UTC)

I just moved the article, since Bondegezou appears to be the only editor who's objecting. Please join us at Misunderstandings of p-values, because there's a lot to write about. I didn't know about the comment here yet (I came to make a notice myself), but if necessary the article can just be merged into this one. I will update the merge tags accordingly. Sunrise (talk) 00:41, 11 March 2016 (UTC)

Shortened lede

I have shortened the lede by the simple device of farming it out to an Overview and controversy section, but my edit has introduced some redundancy. I would value it if another editor were to fix this without recereating the problem I solved. — Charles Stewart (talk) 12:16, 13 April 2016 (UTC)

Laplace's use of p-values not properly cited

The citation (a paper by Stigler) does not mention Laplace at all, only Fisher. Stigler wrote a book about the history of measurement of uncertainty before 1900, maybe this is the intended citation. — Preceding unsigned comment added by Jhullman (talkcontribs) 17:51, 19 April 2016 (UTC)

External links modified

Hello fellow Wikipedians,

I have just modified one external link on P-value. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—cyberbot IITalk to my owner:Online 23:51, 4 May 2016 (UTC)

Edit in need of discussion

An edit [1] introduced a change that is worth discussion, replacing, in the overview section, the text:

In frequentist inference, the p-value is widely used in statistical hypothesis testing, specifically in null hypothesis significance testing. In this method, as part of experimental design, before performing the experiment, one first chooses a model (the null hypothesis) and a threshold value for p, called the significance level of the test, traditionally 5% or 1% [1] and denoted as α. If the p-value is less than or equal to the chosen significance level (α), the test suggests that the observed data is inconsistent with the null hypothesis, so the null hypothesis must be rejected. However, that does not prove that the tested hypothesis is true. When the p-value is calculated correctly, this test guarantees that the Type I error rate is at most α. For typical analysis, using the standard α = 0.05 cutoff, a widely used interpretation is:
  • A small p-value (≤ 0.05) indicates strong evidence against the null hypothesis, so it is rejected.
  • A large p-value (> 0.05) indicates weak evidence against the null hypothesis (fail to reject).
  • p-values very close to the cutoff (~ 0.05) are considered to be marginal (need attention).
So, the analysis must always report the p-value, so readers can draw their own conclusions.

with

In frequentist inference, the p-value is widely used in statistical hypothesis testing, specifically in null hypothesis significance testing. In this method, as part of experimental design, before performing the experiment, one first chooses a model (the null hypothesis) and a threshold value for p, called the significance level of the test, traditionally 5% or 1% [1] and denoted as α. If the p-value is less than or equal to the chosen significance level (α), the test suggests that the observed data is inconsistent with the null hypothesis, so the null hypothesis must be rejected. However, that does not prove that the tested hypothesis is true. When the p-value is calculated correctly, this test guarantees that the Type I error rate is at most α. For typical analysis, using the standard α = 0.05 cutoff, the null hypothesis is rejected when p < .05 and not rejected when p > .05. The p-value does not in itself support reasoning about the probabilities of hypotheses but is only a tool for deciding whether to reject the null hypothesis.

The edit does indeed replace a fluffy unsourced discussion with something sharper, but maybe it is worth having some sort of case-by-case account of how to handle p-values. Thoughts? Can we source the interpretation? — Charles Stewart (talk) 22:08, 29 April 2016 (UTC)

Thanks for the input. I may be one of the person who wrote this "fluffy" text, which is certainly sourced (Nature| volume = 506| issue = 7487| pages = 150–152).
My concern, for this page, as for many others, is that a random user seeking a quick answer will certainly be discouraged by the amount of ultra detailed information that in the information theory point of vue would amount to a large noise covering useful information ;-).
This is an encyclopedia, it must be useful. This page IMO is now nearly useless.' JPLeRouzic (talk) 05:33, 9 August 2016 (UTC)

Misconception on p-values in intro?

I'm just reading about many of the misconceptions on the p-values, and as far as I can see one is reproduced in the intro. The intro states:

When the p-value is calculated correctly, this test guarantees that the Type I error rate is at most α.

Whereas Steve Goodman states:

"Misconception #9: P = .05 means that if you reject the null hypothesis, the probability of a type I error is only 5%. Now we are getting into logical quicksand. This statement is equivalent to Misconception #1, although that can be hard to see immediately. A type I error is a “false positive,” a conclusion that there is a difference when no difference exists. If such a conclusion represents an error, then by definition there is no difference. So a 5% chance of a false rejection is equivalent to saying that there is a 5% chance that the null hypothesis is true, which is Misconception #1."

Goodman, Steven. "A dirty dozen: twelve p-value misconceptions." Seminars in hematology. Vol. 45. No. 3. WB Saunders, 2008. http://www.perfendo.org/docs/BayesProbability/twelvePvaluemisconceptions.pdf

Should we erase the sentence or do I misunderstand it and it tells us a slightly different thing? Hildensia (talk) 20:19, 19 September 2016 (UTC)

A bit late in the reply, but there is nothing wrong with the intro statement. Goodman is warning against conflating p-value with the cutoff rate alpha, which are different concepts. See the definition and interpretation section of the article. Manoguru (talk) 03:38, 17 April 2017 (UTC)

This is one of the most redundant wiki articles I've ever seen

We don't need to redefine the p-value in an "overview" section (that's what the lede is for) and then again in a "definition and interpretation" section and again in a "basic concepts" section. We also don't need to describe critiques of p-values in a "controversy" section and again in a "misunderstandings" section and again in a "criticisms" section.

I've begun to cut out some of the redundant (and largely unsourced material).

We should define the p-value and note the controversy, but not ramble on redundantly in section after section about either the correct definition or the misinterpretation or both.164.67.77.247 (talk) 23:14, 7 March 2017 (UTC)

I've done some consolidation to reduce redundancy, but I think more can be done. For those that doubt that the sections were redundant, here's an example:

  • Lede section - " the p-value is the probability that, using a given statistical model, the statistical summary (such as the sample mean difference between two compared groups) would be the same as or more extreme than the actual observed results."
  • "Overview and Controversy" section - "The p-value is defined informally as the probability of obtaining a result equal to or 'more extreme' than what was actually observed, when the null hypothesis is true."
  • "Definition and interpretation" section - "The p-value is defined as the probability, under the assumption of hypothesis H, of obtaining a result equal to or more extreme than what was actually observed."

Another example:

  • Overview and Controversy" section - "The p-value does not, in itself, support reasoning about the probabilities of hypotheses"
  • Basic Concepts section - "p-values should not be confused with probability on hypothesis"
  • Misunderstandings section - "The p-value does not in itself allow reasoning about the probabilities of hypotheses"
  • Criticisms section - "p-values do not address the probability of the null hypothesis being true or false"
  • Definition and interpretation section (in figure textbox) - "Pr(observation|hypothesis) ≠ Pr(hypothesis|observation) ... The probability of observing a result given that some hypothesis is true is not equivalent to the probability that a hypothesis is true given that some result has been observed." 23.242.207.48 (talk) 05:46, 8 March 2017 (UTC)

There aren't any article layout guides for mathematics, but WP:MEDORDER and the chemistry MOS exist. Following WP:MEDORDER we would end up with something like:

  • Classification -> definition
  • Characteristics -> orthographic note
  • Causes -> basic concepts
  • Mechanism -> calculation
  • Diagnosis -> interpretation
  • Prevention or Screening -> distribution
  • Treatment or Management -> related quantities
  • Epidemiology -> examples
  • History -> history
  • Society and culture -> controversy

as the layout. --Mathnerd314159 (talk) 08:27, 8 March 2017 (UTC)

I disagree with the sentiment regarding redundancy. Since this article is most likely to be accessed by lay public or students just getting a hang of statistics, a little redundancy goes a long way in helping learn and reinforce this concept. This article is of immense public interest, as such I think it should read like a primer, with all the relevant concepts available in a single article. As such I do not agree on the recent changes, particularly the deletion of the concept of null hypothesis test. It would be pretty unhelpful if the reader needs to click a link and read another article to understand this particular article. For the time being, I am re-inserting the stuff on null hypothesis test. Manoguru (talk) 03:47, 17 April 2017 (UTC)

Alleged distinction between "scientific" and "statistical" hypotheses

The following passage is unsourced and unclear, and seems to only add confusion:

"It should be emphasised that a statistical hypothesis is conceptually different from a scientific hypothesis. Therefore, in order to apply the null hypothesis test, the scientific hypothesis should first be converted into a suitable statistical hypothesis. For instance, in a clinical trial, the claim may be that there is a difference between two study groups, whereas its counter-claim would be that there is no such difference. Here, the "no difference between two groups" is a scientific claim, and as such a scientific null hypothesis."

That appears to make no sense. Why is "no difference between groups" not a "statistical" hypothesis? It is in fact the null hypothesis in NHST. Note that after saying that the distinction between "scientific" and "statistical" should be emphasized, that distinction hasn't even been defined. If there is a meaningful difference between the two, it is not explained by the example. If the editor feels strongly about including this passage, the editor should find a reputable, authoritative source on the topic and paraphrase its explanation. — Preceding unsigned comment added by 2605:E000:8443:8D00:A1A2:26F6:EDF3:743A (talk) 21:54, 4 June 2017 (UTC)

The distinction is implied in the definition of the statistical hypothesis, which is given at the beginning of the paragraph. Statistical hypothesis refers to a distribution from which the data is drawn from. E.g. standard normal distribution, Cauchy distribution, chi-squared distribution. "No difference between two groups," "Earth is flat", "Earth revolves around the sun" are general statements, not a distribution. I guess the failure to recognise this distinction this is one of the reason why people misuse NHST. As such I am reverting the passage back. If you feel like you can improve the overall statements, then please feel free to make change. Also, as a general courtesy to other editors, please refrain from outright deleting the passages without reaching a consensus (or giving some warning) in the talk page. Manoguru (talk) 06:26, 5 June 2017 (UTC)

Manoguru still hasn't provided a source for the contended distinction, so I am removing it in accordance with wiki policy. A single user believing something is true, without proper citations, is not sufficient for inclusion. The burden of consensus and "general courtesy" is on Manoguru in this case. Moreover, Manoguru's explanation for the supposed distinction between "scientific hypothesis" and "statistical hypothesis" appears contradictory. "No difference between two groups" is a null hypothesis in what the article calls "statistical hypothesis testing," yet Manoguru claims "no difference between groups" is only a "scientific hypothesis," not a "statistical hypothesis." Manoguru also claims that statistical hypotheses are about what shape (normal, chi square, etc.) the distribution of the data has. That seems to be a very unconventional definition of "statistical hypothesis." Although p-values can be used in tests of, say, departure from normality, p-values are more often used for tests of mean differences or associations. Indeed, the shape of the distribution is not typically what is being tested in NHST; rather, a particular distribution is more often an assumption required for the validity of the test. 164.67.77.247 (talk) 17:09, 5 June 2017 (UTC)

Distribution of a test statistic?

In the Calculation section, end of first paragraph:

"As such, the test statistic follows a distribution determined by the function used to define that test statistic and the distribution of the input observational data."

I am only an arm-chair math fan, but if I use my think box real hard, I can usually grok the message. Here, I was confused. Are we talking a 'probability' distribution? If so, how does this statement follow from what precedes it? I'm trying to make a picture of it in my mind, and it is going all M.C. Escher on me. OmneBonum (talk) 07:37, 8 July 2017 (UTC)

Opening definition is not general

It should be "The p-value is the upper limit (limit supremum) of the probability of obtaining a random sample with a test statistic more extreme (more contradictory to the null hypothesis) than what was observed when the null hypothesis is true."

Discussion: The given definition is ok for a "simple" null hypothesis, one that completely specifies the distribution of the test statistic. Often tests are "directional" such as H0: mu <= 0 and Ha: mu>0. Note that now H0 can be true for infinitely many values. The p-value is calculated when H0 is true "with equality" meaning calculated with mu=0. Any other value of mu gives a smaller probability of an extreme test statistic. So, the complete definition needs to indicate that the actual p-value results as the upper limit of these probabilities.

It might be nice to say in addition that the p-value is a "worst case" probability of observing a random sample having a test statistic at least as contradictory to the null hypothesis as what was observed, given that the null hypothesis is true. When the null hypothesis includes more than one possible distribution for the test statistic, then the p-value is calculated from the one that gives the largest probability of such an extreme test statistic. Joe Sullivan (talk) 21:07, 12 May 2019 (UTC) Joe Sullivan May 12, 2019

Current opening sentence literally false

It appears to assume that one can everywhere replace "more extreme" (the correct definition) with "greater" but for many statistics the opposite is true. An example would be a Shapiro-Wilk. (It used to be correct; it bothers me that correct material in the stats articles is so often replaced with incorrect material. Barely any basic stats articles seem safe for long.) Glenbarnett (talk) 01:58, 4 March 2019 (UTC)

I just edited the opening sentence. I thought it was (a) wrong and (b) ambiguous. I changed it to: "In statistical hypothesis testing, the p-value or probability value or significance is, for a given statistical model, the maximal probability that, when the null hypothesis is true, the statistical summary (such as the sample mean difference between two compared groups) would be greater than or equal to the actual observed results." However, I did not check if this corresponds to words in the reference [1]. So I have to check that, fast. Richard Gill (talk) 09:13, 14 April 2019 (UTC)

Regarding the "more extreme" versus "greater" issue: I think this is solved by realising that it makes a difference whether one thinks of the test statistic as being, for example, "T", or "|T|". So I think that it would be fine to use the words "greater than or equal to". In fact, "more extreme" is *wrong*. It should be "as extreme or even more so". [I do realise that in wikipedia, truth is not an issue. The only criterium is what reliable sources say. But as a mathematician I will always look for a compromise in which the truth is not compromised]. Richard Gill (talk) 09:13, 14 April 2019 (UTC)

I have edited the first sentence again. Now I have it reading "In statistical hypothesis testing, the p-value or probability value or significance is, for a given statistical model, the maximal probability that, when the null hypothesis is true, the statistical summary (such as the absolute value of the sample mean difference between two compared groups) would be greater than or equal to the actual observed results". Richard Gill (talk) 09:18, 14 April 2019 (UTC)

Hi Richard, to me it's unclear why do you use "maximal probability" instead of "probability". Further, "for a given statistical model, ...., when the null hypothesis is true, " seems to be redundant. In fact, the informal definition of p-value in the ASA statement (https://amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108) is "a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value." which appress more clear to me. Why don't we just use that definition? Lucleon (talk) 11:49, 14 April 2019 (UTC)
Hi Lucleon, yes it *appears* more clear. But appearances can be deceptive. The problem here is that more often than not, the "specified statistical model" does not actually specify the probability distribution of the chosen statistical summary. For instance, consider having a random sample from N(mu, sigma^2), and testing mu <= 0 against mu > 0 (sigma unknown) and using the usual t-statistic. Richard Gill (talk) 10:28, 17 April 2019 (UTC)
Thanks for the reply and the clarification regarding the "maximal probability" which is relevant for composite hypothesis. The formulation "for a given statistical model, the probability that, when the null hypothesis is true," still appears redundant to me as I think it would be sufficient to only use statistical model but I may be wrong. Lucleon (talk) 16:00, 17 April 2019 (UTC)
I guess the word “given” may be superfluous. Richard Gill (talk) 19:39, 21 April 2019 (UTC)
I agree with Richard - see my post on this topic. Seems like "upper limit" or "limit supremum" is clearer than "maximally". This refinement, as Richard noted, is only needed when the null hypothesis does not completely specify the distribution of the test statistic.

Joe Sullivan (talk) 21:14, 12 May 2019 (UTC)

Confusing

This has the most confusing, overly-verbose, appositive-laden opening sentence of any article I've ever read on wikipedia. — Preceding unsigned comment added by 2600:1700:B010:AFB0:AD53:9B1:90C1:28D6 (talk) 04:52, 12 October 2019 (UTC)

Case of a composite null hypothesis

If a null hypothesis is composite, there is not just one null-hypothesis probability that your test statistic exceeds the value actually observed, but many. For instance, consider one-sided test that a normal mean is less than or equal to zero versus the alternative that it is strictly larger than zero. The p-value is computed "at mu = 0" since this gives the *largest* possible probability under the null hypothesis. The ASA definition is inadequate. It is correct for simple null hypotheses, but not in general. We need an authoritative published general case definition, to show people that they cannot rely narrowly on what the ASA said. Richard Gill (talk) 13:15, 24 August 2020 (UTC)

Notice that an important characteristic of p-value is that in order to have a hypothesis test of level alpha, one rejects if and only if the p-value is less than or equal to alpha. An alternative definition of p-value is: the smallest significance level such that one could still (just) reject the null hypothesis at that level of significance. The level of significance is defined as the *largest* probability under the null hypothesis of (incorrectly) rejecting the null. Hence the p-value is the largest probability under the null of exceeding (or equalling) the observed value of the test statistic. At least, for a one-sided test based on *large* values of some statistic.

I know that "truth" is not an allowed criterion for inclusion on Wikipedia. One must have reliable sources, and notability. But I do think it is important at least not to tell lies on Wikipedia if it can be avoided, even if many authorities do it all the time. Perhaps those are "lies for children". But children will hopefully grow up and they must be able to find out that they were told a simplified version of the truth. (That's just my personal opinion). Richard Gill (talk) 13:47, 24 August 2020 (UTC)

Here is a good literature reference: Chapter 17 Section 4 (p. 216) [sorry, I first had "Chapter 7"] of "All of Statistics: A Concise Course in Statistical Inference", Springer; 1st edition (2002). Larry Wasserman Richard Gill (talk) 15:15, 24 August 2020 (UTC).

And another: Section 3.3 of Testing Statistical Hypotheses (third edition), E. L. Lehmann and Joseph P. Romano (2005), Springer. Richard Gill (talk) 16:08, 24 August 2020 (UTC)


The editor's point is well taken that in the case of one-sided tests, one could argue that a nonzero effect in the uninteresting direction constitutes a "null hypothesis" scenario, and the probability distribution of the test statistic depends on the exact size of that uninteresting effect. But that appears to be an equivocal definition of the null hypothesis. The p-value is computed based on the point null hypothesis distribution, and there is only one of those. If that issue needs clarification, that clarification should be sourced, clearly explained, and placed in a note somewhere about the special case of one-sided tests--not in the lede, not in the main definition of the p-value, and not using unsourced, unexplained, potentially confusing phrases such as "best probability." Understanding what a p-value is presents enough of a challenge without adding unnecessary complications!
I have examined the two sources that Richard Gill claims support the "best/largest probability" phrasing. But on the contrary, they : both support the standard phrasing. Indeed, Lehmann & Romano (2005, p. 64) define the p-value as P0(X > x), where P0 is the probability under the null hypothesis, X is a random-variable test statistic, and x is the observed value of X. And Wasserman (2004) defines the p-value as: "the probability (under H0) of observing a value of the test statistic the same as or more extreme than what was actually observed" (p. 158). The phrases "best probability" and "largest probability" do not appear.
Thus, I see three compelling reasons that it's more appropriate to simply say "probability," rather than "best probability" or "largest probability":
(1) The latter phrasings appear to be nonstandard, as authoritative sources (including the ASA statement) appear to consistently use the former.
(2) The phrases "best probability" and "largest probability" are unclear and potentially confusing.
(3) The arguments by the one editor pushing for the nonstandard phrasing are directly contradicted by that editor's own provided sources 23.242.198.189 (talk) 09:45, 15 December 2020 (UTC)
Sorry, anonymous editor, you are referring to the places in the text where the authors define the P-value for the case of a simple null hypothesis. If the null hypothesis is composite, P0(X > x) is not even defined!!!! Please register as a user so we can discuss this further. Richard Gill (talk) 16:10, 17 February 2021 (UTC)
I checked the two books I had cited before. For Wasserman, see Chapter 17 Section 4 (p. 216) [I had formerly referred to chapter 7 by mistake); for Lehmann and Romano, see Lemma 3.3.1, formula (3.12), which defines p-value for composite hypotheses. Richard Gill (talk) 14:40, 20 February 2021 (UTC)

If you think the definition should include the words "best probability" or "largest probability," your task is simple: Provide authoritative sources that use that language. You haven't done that. Instead, you've provided two sources that DON'T use that language. There's really nothing to discuss. 23.242.198.189 (talk) 19:30, 13 March 2021 (UTC)

The sources I refer to use strict, concise, precise mathematical language. If you like, I can write out their formulas in words. If necessary we can reproduce the formulas in the article. If you can’t read their formulas, then you have a problem, you will have to rely on those who can. I have a huge number of statistics text books in my office, which I haven’t visited for more than a year. I don’t fancy buying 20 eBooks right now. Richard Gill (talk) 13:07, 15 June 2021 (UTC)

In particular, Wasserman writes in a displayed formula "p-value equals the supremum over all theta in Theta_0 of the probability under theta that T(X^n) is greater than or equal to T(x^n)". Here, T(.) is the statistic you are using (a function of your data); x^n is the data you actually observed; X^n is the data thought of as a random vector with a probability distribution that depends on some parameter theta. Theta_0 is the set of parameter values that constitute the null hypothesis.

Lehmann and Romano, first of all, define the *size* of a test as the supremum over theta in Omega_H (the null hypothesis) of the probability that the data lies in the rejection region. Next, they suppose that we have a family of tests with nested rejection regions, each with its own size; obviously, size increases as the rejection region gets larger. They then define the p-value to be the smallest size at which the test rejects (formula 3.11). The result is summarized in their Lemma 3.3.1: the p-value is a random variable such that for *all* theta in the null hypothesis, and for any number u between zero and one, the probability that the p-value is less than or equal to u is itself less than or equal to u. I think that "best probability" or "largest probability" are excellent ways to translate "supremum" into plain English whereby I do not bother the reader that a supremum might only be approached arbitrarily closely, not necessarily achieved. In common examples the supremum is achieved, ie it is a maximum. The biggest, or indeed, the best probability...

Notice, by the way, that the ASA statement is controversial. The Americal Statistical Association has charged a committee with publishing a re-appraisal of p-values and significance testing and it has just come out, https://www.e-publications.org/ims/submission/AOAS/user/submissionFile/51526?confirm=79a17040 . It is not so simplistic or dogmatic.

Here is another reference: An Introduction to Mathematical Statistics; Fetsje Bijma, Marianne Jonker, Aad van der Vaart (2017). ISBN 9789048536115. Amsterdam University Press. Definition 4.19. The p-value is equal to the supremum taken over theta in Theta_0 of the probability under theta that T (your test statistic) is greater than or equal to t (the value you observed of your chosen statistic).

Amusingly, David Cox (2006) Principles of Statistical Inference, does not explicitly define p-values for general composite hypotheses. He focusses on ways of finding methods for getting exact p-values, for instance by conditioning on ancillary statistics, or approximate p-values, for instance from asymptotic theory. Another standard text which avoids the subject is the standard text by Hogg, McKean and Craig. The nearest they get is by saying "Moreover, sometimes alpha is called the 'maximum of probabilities of committing an error of Type I' and the 'maximum of the power of the test when H0 is true'.” This is strange language: they should say that alpha is sometimes *defined* in this way. Consequently, the p-value is *defined* in the way done by the other more mathematically precise authors whom I have cited here. Hogg and Craig say that they want to warn the reader that they may come across different use of language than the language they use. Their book is entitled "Introduction to Mathematical Statistics", but in fact they are not very mathematical.Richard Gill (talk) 17:22, 20 June 2021 (UTC)

Alternating Coin Flips Example Should Be Removed

"By the second test statistic, the data yield a low p-value, suggesting that the pattern of flips observed is very, very unlikely. There is no "alternative hypothesis" (so only rejection of the null hypothesis is possible) and such data could have many causes. The data may instead be forged, or the coin may be flipped by a magician who intentionally alternated outcomes.

This example demonstrates that the p-value depends completely on the test statistic used and illustrates that p-values can only help researchers to reject a null hypothesis, not consider other hypotheses."

Why would there be "no alternative hypothesis?" Whenever there is a null hypothesis (H0), there must be an alternative hypothesis ("not H0"). In this case, the null hypothesis is that the coin-flipping is not biased toward alternation. Consequently, the alternative hypothesis is that the coin-flipping IS biased toward alternation. It seems that author of this passage did not understand what "alternative hypothesis" means. The same confusion is apparent in the claim that p-values can't help researchers "consider other hypotheses." There are other problems with the passage as well (e.g., the unencyclopedic phrase "very, very" and, as another editor noted, a highly arbitrary description). I suggest getting rid of the whole section, which is completely unsourced, is full of questionable claims, is likely to cause confusion, and serves no apparent function in the article. — Preceding unsigned comment added by 23.242.198.189 (talk) 01:50, 24 July 2019 (UTC)

Also, the very concept of coin-flipping that is biased toward alternation is quite odd and not particularly realistic outside of a fake-data scenario. The examples of trick coins that are biased towards one side or the other are much more intuitive, and thus much more useful in my opinion. 23.242.198.189 (talk) 06:55, 24 July 2019 (UTC)

What on Earth is "in my opinion" supposed to mean in an unsigned "contribution"?
FWIW, I agree with that opinion. I have neither seen nor ever heard of a coin being biased to alternate and cannot imagine how one might be made.
David Lloyd-Jones (talk) 08:19, 4 May 2020 (UTC)
I imagine that "in my opinion" means the same thing in an unsigned contribution that it means in a signed contribution. I don't see why that should be confusing or why there would be a need to put quotes around "contribution." 99.47.245.32 (talk) 20:16, 2 January 2021 (UTC)

Actually, most of the examples are problematic, completely unsourced, and should be removed. For instance, the "sample size dependence" example says: "If the coin was flipped only 5 times, the p-value would be 2/32 = 0.0625, which is not significant at the 0.05 level. But if the coin was flipped 10 times, the p-value would be 2/1024 ≈ 0.002, which is significant at the 0.05 level." Huh? How can you say what the p-value will be without knowing what the results of the coin-flips will be? And the "one roll of a pair of dice" example appears to be nonsensical; it's not even clear how the test statistic (the sum of the rolled numbers) is supposed to relate to the null hypothesis that the dice are fair, and the idea of computing a p-value from a single data point is very odd in itself. Thus, the example doesn't seem very realistic or useful for understanding how p-values work and actually risks causing confusion and misunderstanding about how p-values work. Therefore, I suggest that the article would be improved by removing all the "examples" except for the one entitled "coin flipping." 131.179.60.237 (talk) 20:42, 24 July 2019 (UTC)

This dreadful article nowhere tells us what a p-value test is, nor how one is calculated. It merely pretends to. The whole thing is just a lot of blather about some p-value tests people have reported under the pretence of telling us "what p-values do" or something of the sort.
The promiscuous and incompetent use of commas leaves two or three lists of supposed distinctions muddy and ambiguous.
Given the somewhat flamboyant and demonstrative use of X's and Greek letters, my impression is that this was written by a statistician of only moderate competence who regards himself, almost certainly a him self, as so far above us all that he need not actually focus on the questions at hand.
David Lloyd-Jones (talk) 08:19, 4 May 2020 (UTC)
Indeed,the article is hopeless. I made some changes a year ago (see "Talk Archive 2") and explained on the talk pages why and what I had done, but that work has been undone by editors who did not understand the difficulties I had referred to. I think the article should start by describing the concept of statistical model: namely a family of possible probability distributions of some data. Then one should talk about a hypothesis: that's a subset of possible probability distributions. Then a test statistic. Finally one can give the correct definition of p-value as the largest probability which any null hypothesis model gives to the value of the statistic actually observed, or larger. I know it is a complex and convoluted definition. But one can give lots of examples of varying level of complexity. Finally one can write statements about p-values which are actually true, such as for instance the fact that *if* the null hypothesis *fixes* the probability distribution of your statistic, and if that statistic is continuously distributed, *then* your p-value is uniformly distributed between 0 and 1 if the null hypothesis is true. I know that "truth" is not a criterion which Wikipedia editors may use. But hopefully, enough reliable sources exist to support my claims. What is presently written in the article on this subject is nonsense. Richard Gill (talk) 14:52, 22 June 2020 (UTC)
I have made a whole lot of changes. Richard Gill (talk) 16:48, 22 June 2020 (UTC)

The article is moving in a good direction, thanks Richard Gill. A point about reader expectations with regards to the article: talk of p-values almost always occurs in the context on NHST; the 'Basic concepts' section is essentially an outline of NHST, but the article nowhere names NHST and Null hypothesis significance testing is a redirect to Statistical inference, an article that is probably not the best introduction to the topic (we also have a redirect from the mishyphenated Null-hypothesis significance-testing to Statistical hypothesis testing). I suggest tweaking the 'Basic concepts' section so that NHST is defined there and have NHST redirect to this article. — Charles Stewart (talk) 19:51, 22 June 2020 (UTC)

Thank Chalst; I have made some more changes in the same direction, namely to distinguish between original data X and a statistic T. This also led to further adjustments to the material on one-sided versus two-sided tests and then to the example of 20 coin tosses. I'm glad more people are looking at this article! It's very central in statistics. The topic is difficult, no doubt about it. Richard Gill (talk) 12:28, 30 June 2020 (UTC)

I reconfigured the Basic Concepts section, building on Gill110951's work and Chalst's comments. I tried to clarify what null hypothesis testing is, what we do in it, and the importance of p-values to it. I focused on stating p-values as rejecting the null hypothesis, and tried to explain the importance of also looking at real-world relevance. (I'm not sure if I should put this here or in a separate section, but it seemed a continuation of what Gill110951 did) TryingToUnderstand11 (talk) 09:55, 20 August 2021 (UTC)

Misleading examples

The examples given are rather misleading. For example in the section about the rolling of two dice the articles says. "In this case, a single roll provides a very weak basis (that is, insufficient data) to draw a meaningful conclusion about the dice. "

However it makes no attempt to explain why this is so - and a slight alteration of the conditions of the experiment renders this statement false.

Consider a hustler/gambler who has two sets of apparently identical dice - one of which is loaded and the other fair. If he forgets which is which - and then rolls one set and gets two sixes immediately then it is quite clear that he has identified the loaded set.

The example relies upon the underlying assumption that dice are almost always fair - and therefore it would take more than a single roll to convince you that they are not. However this assumption is never clarified - which might mislead people into supposing that a 0.05 p value would never be sufficient to establish statistical significance. Richard Cant — Preceding unsigned comment added by 152.71.70.77 (talk)

That cheating gambler would be wrong in his conclusion 1 out of 36 times though Yinwang888 (talk) 16:31, 24 November 2021 (UTC)

Recent edits

It is rather traditional that values of 5% and 1% are chosen as significance level. In fact the value of p itself is an indication of the strenght of the observed result. Whether or not the null hypothesis may be rejected is also a matter of 'taste'. But anyway does a small p-value suggest that the observed data is sufficiently inconsistent with the null hypothesis.Madyno (talk) 09:47, 8 December 2021 (UTC)

The .05 level is by far the most conventional level. The .01 level is sometimes used but much more rarely. But in any case, the "Usage" section was mainly just a repetition of what had already been said in the "Basics Concepts" section and the "Definition and Interpretation" section, so I've trimmed it down considerably. A section that's just restating what's already been said doesn't need to give so much detail (if it needs to exist at all). 23.242.195.76 (talk) 02:28, 15 December 2021 (UTC)

Does the hyphenization indeed vary?

"As far as I'm aware, APA guidelines say you have to italicize every statistic, period. Saying "p value" is no different than saying "DP value". I mean, it's not a symptom of dropping the hyphen, but merely a situation where the topic was the value of p, rather than the p-value. Whether that makes sense, i.e., that there really exists a difference between these situations which justifies the different styling, I do not know. But I'm under the impression that that's how people use it. It's the rationalization that I have been able to do, since I have seen many articles formatted under APA style that use "p-value" at some point. ~victorsouza (talk) 16:57, 17 March 2022 (UTC)

In short, yes, hyphenization does indeed vary. I've seen "p value" with and without a hyphen in APA journals. In AMA journals (such as JAMA), I've typically seen "p value" or "P value" unhyphenated. But in American Statistical Association sources, I nearly always see "p-value" hyphenated. Regarding your claim that "APA guidelines say you have to italicize every statistic, period," there's no such guideline. In fact, the official APA style blog (https://blog.apastyle.org/apastyle/hyphenation/) explicitly recommends "t test" not be hyphenated unless used as an adjective (e.g., "t-test results"). 172.91.120.102 (talk) 05:01, 25 April 2022 (UTC)

continuous variables

Note that in the statistics of continuous variables, the probability that a variable will have any specific value is zero. (Unless it comes from a delta function.) In a statistical sense < and <= are the same. In numerical approximations, one might have to be more careful, but then that comes from the process of doing the approximation, not from the statistics itself. Gah4 (talk) 20:56, 25 April 2022 (UTC)

True. And in nearly all real-word circumstances, saying something like p ≤ .05 is indeed equivalent to saying something like p < .05. But not all variables are continuous. For some situations involving count data, you can end up with p-values that are rational numbers, and in theory the p-value could even be exactly .05. So I see the purpose of the "less than or equal to" language for the sake of more universal technical correctness, even if only to accommodate unlikely theoretical cases. 172.91.120.102 (talk) 06:06, 26 April 2022 (UTC)
Hmm, OK. In most problems that I know, either the p variable is continuous, or close enough to continuous that assuming it is, is close enough. In the cases were it isn't, I am not so sure it makes sense either way. That is, if you have a problem where the difference between < and <= seems important, there is probably something else to worry about more. Gah4 (talk) 05:04, 27 April 2022 (UTC)
I agree that there is likely no practical example where there is a consequential distinction between p < .05 and p ≤ .05. Even when the p-value is from a discrete distribution and is a rational number, I don't think it's plausible for it to be exactly .05 except in a highly contrived theoretical scenario. That said, I don't really see a drawback to using the ≤ symbol rather than the < symbol if that placates some theoretical quibble. 134.69.229.134 (talk) 19:24, 27 April 2022 (UTC)
  1. ^ a b Nuzzo, R. (2014). "Scientific method: Statistical errors". Nature. 506 (7487): 150–152. doi:10.1038/506150a.