Cluster analysis

Mixtures are identified by applying the sparse nonnegative matrix underapproximation (SNMU) (Gillis and Plemmons (2013)) to the exposure (or hbm-concentration) data matrix. The SNMU solution approximates the exposure matrix \(M\) \((m \times n)\) by two nonnegative matrices \(U\) and \(V\) with lower dimensions, \((m \times k)\) and \((k \times n)\) with \(m\) the number of substances, \(n\) the number of individual(day)s and \(k\) the number of mixtures. Matrix \(U\) represents the weights of the substances per mixture and matrix \(V\) contains the coefficient of the presence of the mixture per individual(day). The nonzero entries in each column of \(V\) indicate the mixture to which the individual was exposed. The higher the value, the higher the contribution of the mixture to the whole exposure.

Crépet et al. (2022) propose to identify mixtures by coupling statistical criteria with the relevance of combined exposure profiles and mixture composition. First, the optimal choice for \((m \times k)\), the number of mixtures, is determined using a trade off between the decrease of the residual sum of squares and number of mixtures. Then, hierarchical clustering was applied to the matrix of coefficients \(V\) to group individual(day)s with similar exposure profiles to the \(m\) mixtures. This identification of mixtures is repeated for different values of \(k\) where inspection of mixtures not relevant to characterize a cluster, or concerned with only a small part of the population leads to rejection of the mixture.

In MCRA, two clustering methods are availabe. The first, hierarchical clustering, is implemented as described in Crépet et al. (2022). Specification of the optimal number of clusters is not needed. Results of the clustering are displayed in a dendrogram. The second one, based on K-Means, requires the number of clusters. The results of the clustering are represented in a scatter plot using principal components and convex envelopes to identify the clusters. Advantages of K-Means clustering is that it is simple and fast and large datasets can be handeled. Visualisation for large data sets is straightforward but for hierarchical clustering dendrograms maybe very dense. Disadvantage of K-Means is that it requires the number of clusters set. For large datasets, hierarchical clustering maybe slow, O(n^2), but for small datasets, the dendrogram helps in interpreting the results and in selecting the optimal number of clusters.

../../../_images/hierarchical.svg

Figure 69 Hierarchical clustering of human monitoring data, 6 clusters, largest and smallest cluster contain 275 and 81 individuals, respectively

../../../_images/kmeans.svg

Figure 70 K-means clustering of human monitoring data, 6 clusters, largest and smallest cluster contain 400 and 19 individuals, respectively