Pre-treatment of soil X-ray powder diffraction data for cluster analysis

X-ray powder diffraction (XRPD) is widely applied for the qualitative and quantitative analysis of soil mineralogy. In recent years, high-throughput XRPD has resulted in soil XRPD datasets containing thousands of samples. The efforts required for conventional approaches of soil XRPD data analysis are currently restrictive for such large data sets, resulting in a need for computational methods that can aid in defining soil property – soil mineralogy relationships. Cluster analysis of soil XRPD data represents a rapid method for grouping data into discrete classes based on mineralogical similarities, and thus allows for sets of mineralogically distinct soils to be defined and investigated in greater detail. Effective cluster analysis requires minimisation of sample-independent variation and maximisation of sample-dependent variation, which entails pre-treatment of XRPD data in order to correct for common aberrations associated with data collection. A 24 factorial design was used to investigate the most effective data pre-treatment protocol for the cluster analysis of XRPD data from 12 African soils, each analysed once by five different personnel. Sample-independent effects of displacement error, noise and signal intensity variation were pre-treated using peak alignment, binning and scaling, respectively. The sample-dependent effect of strongly diffracting minerals overwhelming the signal of weakly diffracting minerals was pre-treated using a square-root transformation. Without pre-treatment, the 60 XRPD measurements failed to provide informative clusters. Pre-treatment via peak alignment, square-root transformation, and scaling each resulted in significantly improved partitioning of the groups (p < 0.05). Data pre-treatment via binning reduced the computational demands of cluster analysis, but did not significantly affect the partitioning (p > 0.1). Applying all four pre-treatments proved to be the most suitable protocol for both non-hierarchical and hierarchical cluster analysis. Deducing such a protocol is considered a prerequisite to the wider application of cluster analysis in exploring soil property – soil mineralogy relationships in larger datasets.