Component exposures in population and subgroups
The SNMU solution of matrix V is used to group individuals with similar mixture exposure profiles. In Figure 84, the idea of clustering is shown.
Crépet et al. (2022) propose to identify components by coupling statistical criteria with the relevance of combined exposure profiles and component composition. First, the optimal choice for \(k\), the number of components, is determined using a trade off between the decrease of the residual sum of squares and number of components. Then, hierarchical clustering was applied to the matrix of individual scores \(V\) to group individual(day)s with similar exposure profiles to the \(k\) components. This identification of components is repeated for different values of \(k\) where inspection of components not relevant to characterize a cluster, or concerned with only a small part of the population leads to rejection of the mixture.
In MCRA, two clustering methods are availabe. The first, hierarchical clustering, is implemented as described in Crépet et al. (2022). Ward’s clustering criterion is implemented using Euclidean distances (Ward.D2, Murtagh and Legendre (2014)) . Specification of the optimal number of clusters is not needed. Results of the clustering are displayed in a dendrogram, Figure 85. The second one, based on K-means, requires specification of the number of clusters. The results of the clustering are represented in a scatter plot using principal components and convex envelopes to identify the clusters, Figure 87.
In Figure 86, the relative exposure to componaents in the total populations are shown. These plots are also available for the subgroups resulting from the clustering.
Advantages of K-means clustering is that it is simple and fast and large datasets can be handeled. Visualisation for large data sets is straightforward but for hierarchical clustering dendrograms maybe very dense. Disadvantage of K-means is that it requires the number of clusters set before. For large datasets, hierarchical clustering maybe slow, O(n^2), but for small datasets, the dendrogram helps in interpreting the results and in selecting the optimal number of clusters.