class: center, middle, inverse, title-slide # Workshop: Limits of Agreement ## Bland-Altman methods for assessing agreement of clinical measurements ### Sam Gardiner ### Cell & Molecular Therapies, Royal Prince Alfred Hospital ### 14 April 2021 --- # The _Limits of Agreement_ method Bland, J. M. and D. G. Altman (1986). "Statistical methods for assessing agreement between two methods of clinical measurement". In: _The Lancet_ 327.8476, pp. 307-310. DOI: [10.1016/s0140-6736(86)90837-8](https://doi.org/10.1016%2Fs0140-6736%2886%2990837-8). - The 29th most-cited paper of all time! (Noorden, Maher, and Nuzzo, 2014) - Still the gold standard for measuring agreement between continuous clinical measurements. - Simple enough to do by hand (in Excel) if needed, but also available in almost all statistical software: R, GraphPad Prism, SAS etc. --- class: middle, center, inverse # Agreement --- # Agreement - It is often useful to compare two methods of measuring some clinical parameter. For example: - One-stage vs. chromogenic FIX activity - Axillary vs tympanic temperature - NucleoCounter vs CELL-DYN cell counts - If the two methods "agree" (within clinically meaningful limits), you might be able to retire the more expensive, more laborious or otherwise less convenient method. --- # Not agreement ## Correlation - What about `\(r\)`, the standard (Pearson product-moment) correlation coefficient? -- - `\(r\)` measures linear correlation between two variables, not agreement. - Two measurement methods can be perfectly linearly correlated, but not agree. - Being correlated just means that two variables tend to go up or down together. - Correlation `\(r\)` is a function of the variability of the data: two variables that cover a wide range will have larger `\(r\)` than similar variables which cover a small range, even if the degree of agreement is the same. --- class: middle # Not agreement ## Perfectly correlated, but not in agreement <img src="index_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- # Not agreement ## Perfectly correlated, but not in agreement <img src="index_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- # Not agreement ## Even worse: <img src="index_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> Data: Anscombe (1973) --- # Not agreement ## Calibration - Is measuring agreement the same as calibration? -- - Generally, **no**. - Calibration compares a single method against a ground truth. - Agreement compares two imperfect methods (which are assumed to have measurement error) with each other. - If the "ground truth" isn't particularly precise, agreement and calibration may be the same concept. --- # Not quite agreement ## Repeatability - Repeatability is a closely-related concept: if a measurement method agrees with itself over repeated measurements, it is _repeatable_. - The Bland-Altman _Limits of Agreement_ methods work well for assessing repeatability, as well. --- class: middle, center, inverse # Assessing agreement --- # Eyeball the data .pull-left[ - Plot: - each method against the other - the line of equality (the line with slope 1, passing through the origin) - Do the observations lie approximately along the line of equality? - Are there any obvious systematic differences? ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> PEFR: Peak expiratory flow rate, a measure of lung function. ] --- # The _Limits of Agreement_ method 0. Decide on a clinically acceptable threshold of agreement. Use clinical reasoning or published evidence. For example, you might consider methods in agreement if they are within - 5mmHg for blood pressure - 0.1 for blood pH - 5% clotting activity for a FIX assay 0. Visualise the magnitude of the measurements against the difference of the two methods. - magnitude: estimate with the mean of the two methods - difference: subtract one method from the other 0. Find the bias and its standard deviation. The bias is the average differences between methods. 0. Find the limits of agreement: - `\(\text{Limits} = \text{Bias} \pm \text{SD(Bias)} \times 1.96\)` 0. Critically appraise: - are there systematic differences between the methods? - is the scale of the difference the same over the range of the measurements? - are the 95% limits of agreement within the predefined clinically meaningful threshold? --- # Why 1.96? <img src="index_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> --- # Anatomy of a Bland-Altman plot .pull-left[ | Subject| Large meter| Mini meter| |-------:|-----------:|----------:| | 1| 494| 512| | 2| 395| 430| | 3| 516| 520| | 4| 434| 428| | 5| 476| 500| | 6| 557| 600| | 7| 413| 364| | 8| 442| 380| | 9| 650| 658| | 10| 433| 445| ] .pull-right[ ## Example dataset: Comparison of **p**eak **e**xpiratory **f**low **r**ate (PEFR in L/minute) by a large Wright peak flow meter and a mini Wright meter, measure in the same subject. Bland and Altman (1986). ] --- # Anatomy of a Bland-Altman plot <img src="index_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> --- # Anatomy of a Bland-Altman plot <img src="index_files/figure-html/unnamed-chunk-9-1.svg" style="display: block; margin: auto;" /> --- # Anatomy of a Bland-Altman plot <img src="index_files/figure-html/unnamed-chunk-10-1.svg" style="display: block; margin: auto;" /> --- # Anatomy of a Bland-Altman plot <img src="index_files/figure-html/unnamed-chunk-11-1.svg" style="display: block; margin: auto;" /> --- # Anatomy of a Bland-Altman plot <img src="index_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- # Anatomy of a Bland-Altman plot <img src="index_files/figure-html/unnamed-chunk-13-1.svg" style="display: block; margin: auto;" /> --- # Assessing agreement ## Assumptions The PEFR example relies on some assumptions about the data: - That there is no systematic change to the degree of agreement over the range of the measurements. - That the measurement error (the difference between the two measurements) is normally distributed. - The limits of agreement and confidence intervals rely on this assumption to be accurate, but should be OK with other distributions as long as the sample size isn't tiny. --- # Assessing agreement ## What if there _is_ a systematic difference? Bland and Altman suggest two remedies: - Working with percentage difference instead of absolute difference - Log-transforming your data. --- # Systematic difference .pull-left[ - Paired plasma samples from the SPK-9001-101 participants were measured by a chromogenic FIX activity assay at the trial central laboratory, and by a one-stage assay at each site's local laboratory. - `\(n=15\)` participants with `\(147\)` measurements. ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-14-1.svg" style="display: block; margin: auto;" /> .center[Robinson, George, Carr, et al. (2021)]] --- # Systematic difference <img src="index_files/figure-html/unnamed-chunk-15-1.svg" style="display: block; margin: auto;" /> --- # Systematic difference ## Plotting the absolute difference would be a mistake <img src="index_files/figure-html/unnamed-chunk-16-1.svg" style="display: block; margin: auto;" /> --- # Systematic difference ## Plotting the absolute difference would be a mistake <img src="index_files/figure-html/unnamed-chunk-17-1.svg" style="display: block; margin: auto;" /> --- # Systematic difference ## Plot the percentage or ratio difference, instead <img src="index_files/figure-html/unnamed-chunk-18-1.svg" style="display: block; margin: auto;" /> --- # More to learn... - Repeated measures versions where each method is used to measure a sample or individual multiple times. - Paired or unpaired? - Constant underlying true value, or time-dependent? (Bland and Altman, 2007) - Applications to transcriptomics (gene expression) data: the MA plot (Dudoit, Yang, Callow, et al., 2002). --- # References and further reading Anscombe, F. J. (1973). "Graphs in Statistical Analysis". In: _The American Statistician_ 27.1, pp. 17-21. DOI: [10.1080/00031305.1973.10478966](https://doi.org/10.1080%2F00031305.1973.10478966). Bland, J. M. and D. G. Altman (1986). "Statistical methods for assessing agreement between two methods of clinical measurement". In: _The Lancet_ 327.8476, pp. 307-310. DOI: [10.1016/s0140-6736(86)90837-8](https://doi.org/10.1016%2Fs0140-6736%2886%2990837-8). Bland, J. M. and D. G. Altman (2007). "Agreement Between Methods of Measurement with Multiple Observations Per Individual". In: _Journal of Biopharmaceutical Statistics_ 17.4, pp. 571-582. DOI: [10.1080/10543400701329422](https://doi.org/10.1080%2F10543400701329422). Dudoit, S., Y. H. Yang, M. J. Callow, et al. (2002). "Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments". In: _Statistica Sinica_ 12, pp. 111-139. Noorden, R. V., B. Maher, and R. Nuzzo (2014). "The top 100 papers". In: _Nature_ 514.7524, pp. 550-553. DOI: [10.1038/514550a](https://doi.org/10.1038%2F514550a). --- # References and further reading Robinson, M. M., L. A. George, M. E. Carr, et al. (2021). "Factor IX assay discrepancies in the setting of liver gene therapy using a hyperfunctional variant factor IX-Padua". In: _Journal of Thrombosis and Haemostasis_. DOI: [10.1111/jth.15281](https://doi.org/10.1111%2Fjth.15281).