Tukey's B method

Tukey's B method, also known as the Tukey-Kramer B procedure, or Tukey's Wholly Significant Difference (WSD) is a post-hoc multiple comparison statistical test used to identify which specific group means differ significantly from each other after a statistically significant result has been obtained from an analysis of variance (ANOVA).^[1] It is considered a compromise between two other popular multiple comparison procedures: Tukey's range test and the Newman-Keuls method.^[2]

The primary purpose of post-hoc tests like Tukey's B is to control the family-wise error rate (FWER) when performing multiple comparisons. Without such control, the probability of making at least one Type I error increases with the number of comparisons made.^[3]

History and context

The development of multiple comparison procedures stems from the work of Ronald Fisher, John Tukey and others in the mid-20th century. Tukey's HSD test is a conservative method that guarantees the FWER does not exceed the chosen significance level (e.g., $\alpha =0.05$ ). Conversely, the Newman-Keuls (NK) method, while providing higher statistical power, is known to be anti-conservative; that is, not strictly controlling the FWER as the number of groups increases.^[2]

Tukey's B method was introduced to provide an intermediate level of conservatism. It seeks to balance the strict error control of HSD with the greater sensitivity to differences offered by Newman-Keuls.^[1]

Methodology

Tukey's B method operates by comparing all possible pairs of means. For each pair, it calculates a critical value based on the studentized range distribution.

While Tukey's HSD uses a single critical value $q_{\text{HSD}}$ derived from the total number of groups ( $k$ ), and Newman-Keuls uses critical values $q_{\text{NK}}$ that vary depending on the number of steps between the ordered means ( $r$ ), Tukey's B calculates the critical value ( $q_{B}$ ) as the simple arithmetic mean of the critical values obtained from those two procedures:^[1]

q_{\text{B}}={\frac {q_{\text{HSD}}+q_{\text{NK}}}{2}}

The absolute difference between two means, $\vert {\bar {X}}_{i}-{\bar {X}}_{j}\vert$ , is then compared against a critical difference value:

{\text{CD}}_{\text{B}}=q_{\text{B}}{\sqrt {{\frac {{\text{MS}}_{\text{error}}}{2}}\left({\frac {1}{n_{i}}}+{\frac {1}{n_{j}}}\right)}}

where:

${\text{MS}}_{\text{error}}$ is the mean squared error from the ANOVA, and
$n_{i}$ and $n_{j}$ are the sample sizes of the groups being compared.

If $\vert {\bar {X}}_{i}-{\bar {X}}_{j}\vert >{\text{CD}}_{\text{B}}$ , the difference is declared statistically significant.

Characteristics and comparison with other methods

Tukey's B method is a standard post-hoc option in statistical packages such as SPSS,^[1] and provides a middle ground for researchers:

Error rate control: it offers better control over the family-wise-error rate than the Newman-Keuls method, but is less conservative than Tukey's HSD.^[2]
Statistical power: it generally has greater statistical power than Tukey's HSD, making it more likely to detect true differences.^[1]

Statistical criticism

In contemporary statistical practice, the procedure has largely fallen out of favor due to several factors:

Theoretical grounding: unlike the Tukey HSD, which is rooted in the distribution of the studentized range, Tukey's B lacks a rigorous mathematical justification for its averaging approach.
Error rate control: because it is a hybrid, it does not guarantee the same level of family-wise error rate protection as more modern, stepwise procedures.
Availability of alternatives: the development of more powerful and theoretically sound procedures, such as the Ryan-Einot-Gabriel-Welsch (REGW) or the Fisher-Hayter test, has rendered Tukey's B largely obsolete in most modern statistical software packages.^[4]^[5]

References

^ ^a ^b ^c ^d ^e Larson-Hall, J. (2009). A guide to doing statistics in second language research using SPSS. Routledge. doi:10.4324/9780203875964. ISBN 978-1-135-59474-9.
^ ^a ^b ^c McHugh, M. L. (2011). "Multiple comparison analysis testing in ANOVA". Biochemia Medica. 21 (3): 203–209. doi:10.11613/bm.2011.029. PMID 22420233.
^ Mishra, Prabhaker; Singh, Uttam; Pandey, Chandra M; Mishra, Priyadarshni; Pandey, Gaurav (2019). "Application of student's t-test, analysis of variance, and covariance". Annals of Cardiac Anaesthesia. 22 (4): 407–411. doi:10.4103/aca.aca_94_19. PMC 6813708. PMID 31621677.
^ Field, Andy (2013). Discovering Statistics Using IBM SPSS Statistics (4th ed.). SAGE Publications. p. 459. ISBN 978-1446249185. Tukey's B is an ad hoc compromise... Generally, if you want a stepwise test, the REGWQ is the best choice.
^ De Muth, James E. (2014). Basic Statistics and Pharmaceutical Statistical Applications. Pharmacy Education Series (3rd ed.). CRC Press. pp. 250–251. ISBN 9781466596740. Tukey's-b is a compromise between the HSD and SNK tests... it is generally considered less robust than modern stepwise procedures like REGWQ.

[Larson-Hall2009-1] Larson-Hall, J. (2009). A guide to doing statistics in second language research using SPSS. Routledge. doi:10.4324/9780203875964. ISBN 978-1-135-59474-9.

[McHugh2011-2] McHugh, M. L. (2011). "Multiple comparison analysis testing in ANOVA". Biochemia Medica. 21 (3): 203–209. doi:10.11613/bm.2011.029. PMID 22420233.

[Mishra2019-3] Mishra, Prabhaker; Singh, Uttam; Pandey, Chandra M; Mishra, Priyadarshni; Pandey, Gaurav (2019). "Application of student's t-test, analysis of variance, and covariance". Annals of Cardiac Anaesthesia. 22 (4): 407–411. doi:10.4103/aca.aca_94_19. PMC 6813708. PMID 31621677.

[Field2013-4] Field, Andy (2013). Discovering Statistics Using IBM SPSS Statistics (4th ed.). SAGE Publications. p. 459. ISBN 978-1446249185. Tukey's B is an ad hoc compromise... Generally, if you want a stepwise test, the REGWQ is the best choice.

[DeMuth2014-5] De Muth, James E. (2014). Basic Statistics and Pharmaceutical Statistical Applications. Pharmacy Education Series (3rd ed.). CRC Press. pp. 250–251. ISBN 9781466596740. Tukey's-b is a compromise between the HSD and SNK tests... it is generally considered less robust than modern stepwise procedures like REGWQ.

[1]

[2]

[3]

[4]

[5]

History and context

Methodology

Characteristics and comparison with other methods

Statistical criticism

See also

References