## Abstract

A background to the use of empirical indices (i.e. ratios and sums) in exploration geochemistry is presented, together with commentary on more sophisticated statistical multivariate procedures. Multivariate statistical procedures can assist in developing geochemical models upon which further investigations can be based, and in identifying geochemically anomalous samples. A case is made for a simple method, weighted sums, that is based on prior knowledge concerning the mineralogy and geochemistry of sought-after mineral resources. This procedure avoids many of the complications and pit-falls of more sophisticated multivariate statistical methods. Although weighted sums were introduced to exploration geochemistry over 20 years ago, they don’t appear to have been used extensively. The objective of this paper is to reintroduce them, with a simple but clear exposition, as a tool worthy of consideration in the knowledge-based 21st century.

## INTRODUCTION

The usefulness of multivariate data analysis methods applied to geochemical data has been well documented (⇓Howarth & Sinding-Larsen 1983). Methods such as principal components analysis (PCA), cluster analysis, and projection pursuit regression provide numerical and graphical means with which the relationships of the elements and samples can be investigated. Many of these techniques are dimension reducing, that is they reduce the number of variables required to describe the variation in the data. An interpretation of the systematic relationships of 30 elements is almost impossible without applying some form of dimension reducing technique. The outcome of these techniques yields significantly fewer, ‘new’, variables that represent the data, and it is anticipated that these can be related to specific geochemical/geological processes. The use of robust estimates for means, standard deviations, and correlation coefficient or covariance matrices assists significantly in assessing multi-element relationships. These robust estimators reduce the impact of outliers that can distort the linear relationships that are obtained in methods such as PCA. Several reviews discuss the basics of multivariate data analysis techniques (⇓Jöreskog *et al*. 1976; ⇓Howarth & Sinding-Larsen 1983; ⇓Davis 1986). Other methods include non-linear mapping (⇓Sammon 1969), and multi-dimensional scaling (⇓Kruskal 1964). ⇓Mellinger (1987) provides a systematic approach to the application of multivariate methods in geological studies.

In many modern geochemical surveys, large numbers of samples are analysed for as many as 60 elements, and multivariate methods can be valuable in examining relationships between samples and elements. The evaluation and interpretation of these data typically involve the recognition of geochemical processes that represent underlying lithological variation, surficial processes and mineralization.

At a workshop of the 10th International Geochemical Exploration Symposium (IGES), Espoo, Finland, 1983, ⇓Mellinger *et al.* (1984) provided a scheme for geochemical anomaly interpretation based on the integration of geochemical, geographical and geophysical data. Three critical steps were defined: preliminary data analysis, descriptive multivariate analysis, and specific multivariate analysis. In the context of today's data analytical tools, these translate to the following sequence for multivariate geochemical data (⇓Grunsky 2000):

visual data inspection using histograms, box & whisker, probability (Q-Q), and scatter plots;

investigation of outliers for each element (analytical error or atypical abundance?);

summary statistical tables;

depending on a prior visual examination of probability plots, etc., investigate the transformation of each element using logarithmic, square root or Box-Cox power transforms (⇓Howarth & Earle 1979) using samples below the 95th–98th percentile;

the use of methods to elucidate structure in the data, such as cluster analysis, multidimensional scaling, and projection pursuit may isolate groups of samples with similar characteristics. Atypical samples stand out as single outliers. Target groups can often be isolated using these methods. Maps of the locations of the groups can help to identify mineralized areas, or areas where particular geochemical processes dominate;

at this point it may be appropriate to subdivide the data into separate subsets if it is clearly poly-populational, with different data structures (means, standard deviations and correlation matrices) in each subset;

the use of χ

^{2}plots of Mahalanobis distances, possibly using transformed data, to identify outliers based on all of the elements of interest. This procedure assists in the elimination of outliers for the creation of ‘clean’ background and target groups that can be subsequently used in canonical variate analysis and for allocation/typicality procedures. Maps of large Mahalanobis distances (>95th percentile) will identify atypical sites or areas worthy of investigation;application of dimension reducing techniques, possibly using transformed data such as PCA, to identify systematic linear relationships between the variables and the samples. The use of robust methods will likely assist in detecting outliers. Dimension reducing techniques also indicate which pathfinder elements are associated with commodity elements. Maps of component scores can assist in outlining regional lithological variation and areas that are anomalous; and

the calculation of empirical indices which are specifically tailored to situations where multi-element associations are well understood. The indices are based on linear combinations of pathfinder elements with coefficients that are selected for each area and commodity being sought. Maps can be prepared and samples with high indices can be investigated for mineralization potential.

Empirical ratios have had a long history of successful use in exploration geochemistry, and this paper was stimulated by a presentation demonstrating an effective use of ratios by ⇓Rebagliati (1999) at the 19th IGES in Vancouver.

⇓Hawkes & Webb (1962) report on the use of ratios as early as the 1920s and 1930s in bedrock geochemistry studies, and ⇓Beus & Grigorian (1977) describe the extensive use of ratios by Russian geochemists. Most of the early work used simple ratios of one element against another; however, these were extended by Russian geochemists (see ⇓Beus & Grigorian 1977) to multi-element ratios and additive indices (sums). The use of more complex additive indices was investigated by Smith and co-workers in Australia (e.g. ⇓Smith & Perdrix 1983; ⇓Smith *et al.* 1987, ⇓1989). These studies included the use of indices derived as scores for individual samples based on the presence of pathfinder elements in excess of threshold levels (⇓Smith *et al.* 1987). This scoring procedure was similar to methods also developed in Canada and the United States (⇓Garrett *et al.* 1980; ⇓Chaffee 1983) for use in regional mineral assessment programs.

For those not averse to using more statistically sophisticated procedures, principal component and factor analysis have held an allure for many exploration geochemists as a tool for reducing the number of maps to be prepared and inspected. The expectation is that these derivative maps will relate more closely to the geochemical processes active in a survey area than single element maps, and thus be easier to interpret and provide better guides to undiscovered mineral resources. There are a variety of approaches to the problem of undertaking these statistical analyses, and the review and exposition by ⇓Howarth & Sinding-Larsen (1983) still stands today. In regional surveys where the mineralization-related features sought are under-represented in the dataset (i.e. are swamped by an abundance of background data), these features, when extracted by principal component/factor analysis, may be represented as components with little or no significance and are often overlooked (⇓Grunsky 2000). As a result, these important, but under-represented, processes are only reflected, if at all in an unambiguous way, in the minor components of the model that also tend to include the ‘noise’, i.e. imprecision and chance relationships due to variability, in the dataset. For example, in an early application of factor analysis to geochemical exploration data (⇓Garrett & Nichol 1969), the reflection of gold-bearing banded ironstones was in the 9th of 13 components, which accounted for only 2.1% of the data set variability. One attempt to circumvent this problem was the development of ‘characteristic analysis’ by ⇓Botbol *et al.* (1978). Characteristic analysis undertakes a PCA of data for a feature of interest (e.g. anomalous geochemical samples related to the mineral deposit type being sought) and estimates weights from the first principal component through the data for a computed sum. These weights are then applied to compute scores for the larger survey data set under investigation. However, often suitably large ‘training’ data sets, remembering that there should be upwards to nine times the number of individual samples as there are variables (⇓Garrett 1993), are not available, thus reducing the attractiveness of this procedure.

An alternative to using a data set from the feature of interest, as in characteristic analysis, is to use knowledge concerning that feature of interest. In weighted sums, *a priori* knowledge is used to choose the weights, relative importances, of each variable according to some geochemical, mineralogical and/or metallogenetic model that is considered significant to an investigation. This model could be for a mineral deposit type as yet unknown in the region under study, but considered to have economic potential if one could be found. Weighted sums are new variables derived from multivariate (multi-element geochemical) data. Like principal component or factor scores, they have a mean of zero and a standard deviation of one, and as such may be treated and interpreted as standardized or normal scores. The major difference compared to traditional multivariate statistical procedures is that with a weighted sum the user is in firm control. This is distinct from principal component or factor analysis where the weights for score calculations are based on the covariance/correlation matrix for the dataset, which is frequently dominated by background geochemical relationships. In addition, an analysis based on the covariance/correlation matrix may be influenced by statistical outliers (see, for example, ⇓Beckman & Cook 1983), geochemical anomalies, so that the resulting components or factors are difficult to interpret. What are known as statistically robust procedures can be used to reduce these problems, and have been used by geochemists to aid in anomaly recognition (e.g. ⇓Garrett 1983, ⇓1989, ⇓1990; ⇓Howarth & Sinding-Larsen 1983; ⇓Zhou 1985; ⇓Chork 1990; ⇓Chork & Rousseeuw 1992). However, these procedures can become complex, and have problems of their own.

In all these cases the simple procedure of weighted sums maybe usefully applied with a degree of confidence, and user knowledge-based control. The weighted sums may even include well established ratios that make sense in terms of some model, such as U:Th or Cu:Ni ratios. Weighted sums do not seem to have been extensively used by exploration geochemists since their introduction to the field over 20 years ago (⇓Kane 1977; ⇓Garrett *et al.* 1980); but those who have used them have often found them to be very effective.

In this paper, the ⇓Howarth & Sinding-Larsen (1983) Norwegian stream sediment test data (see Appendix 1) are used to demonstrate the application of weighted sums, and then make a comparison of those results with PCA and multivariate χ^{2} approaches. These data have been used extensively in comparisons of statistical and numerical procedures of interest to exploration geochemists by Howarth & Sinding-Larsen and ⇓Garrett (1983, ⇓1989, ⇓1990).

## WEIGHTED SUMS METHODOLOGY

The relative importances, *r _{j}* (here after termed ‘importances’) of the

*m*variables,

*j*=1,

*m*, to be included in the weighted sum are defined in ratio form. Thus, as a four-variable example, 2:1:3:1 indicates that variable 3 is 3 times more important (i.e. indicative of the sought after process or feature) than variables 2 and 4, which have equal importance. If high (above mean) values are important, the sign of the importance is positive; conversely, if low (below mean) values are important the sign is negative. The individual,

*j*

^{th}of m, importances are converted to weights that sum to one,

*w*follows by dividing each importance by the sum of the absolute values of the importances, i.e. the sum ignoring the signs of the importances:which, for the four variable example, yields weights of 0.286, 0.143, 0.428 and 0.143. A requirement of the weighted sums procedure is that the sums of the squares of the final weights, coefficients

_{j}*a*, also equal to one. This is achieved by the following transformation that divides each weight by the square root of the sum of the squares of the weights, i.e.:which for the example yields coefficients of 0.517, 0.259, 0.774 and 0.259.

_{j}The next step is to compute the normal scores of the m variables used in the weighted sum for each of the *i*, *i*=1,*n*, individual geochemical samples whose weighted sum is required. To compute the normal scores, estimates of the mean, *x̄ _{j}*, and standard deviation,

*s*, are required for each of the m variables to be used. These summary statistics, which are quantitative expressions of geochemical level and relief, should represent the background properties of the variables, geochemical analyses, in the study area. To achieve this, statistically robust estimates of the mean and standard deviation are preferable. Such estimates are relatively uninfluenced by high and low values in the data set that may reflect features or processes other than the geochemical background that is frequently related to bedrock lithology and secondary environmental processes (e.g. stream sediment and soil formation). The advantage of using a robust estimate that focuses on the background population(s) is that it tends to accentuate the presence of non-background individuals, that are of the greatest interest in mineral exploration.

_{j}A wide range of methods are available in statistics packages to make robust estimates. However, the simplest procedure, and one that does not require a sophisticated statistics package, is to use the median, *M*, as an estimator of the robust mean, *x̄ _{j}**, and to make a robust estimate of the standard deviation,

*s**, from the Interquartile Range, IQR, where the IQR is the difference between the 3rd and 1st quartiles (75th and 25th percentiles) of the data. The IQR is also approximately equal to the H-spread or the Midrange displayed by some statistics packages, and these estimates may be used instead of the IQR. Thus:andFor the large data sets often encountered in geochemical exploration programs, the high resistance of the median and IQR range to outliers usually makes initial trimming of obvious outliers and fliers unnecessary. However, it is always prudent to check to see if better background estimators can be obtained by some judicious trimming (removing) of extreme values prior to computation of the median and IQR.

_{j}Finally, the normal scores for the individual samples, and then the weighted sums, may be computed as follows. Firstly, the normal score, *z _{ij}*, for the

*j*

^{th}variable of the

*i*

^{th}case in the data set is computed from the geochemical analyses,

*x*, using the robust estimates of mean and standard deviation:and the weighted sum for the

_{ij}*i*th case in the data set,

*WS*, may be computed from the normal scores,

_{i}*z*, and the weights,

_{ij}*a*, by:

_{j}### Weighted sums example

The problem addressed in the ⇓Howarth & Sinding-Larsen (1983) test data set (Appendix 1) is how to differentiate anomalies associated with a Zn–Cd–Cu mineral occurrence from false anomalies, particularly involving Zn, associated with sections of stream drainage characterized by Fe– and Mn–oxy-hydroxide precipitation. It can be argued that the Cu data provide the key, and the problem can be resolved simply by using those data. However, in order to demonstrate the value of weighted sums, it is assumed that no Cu data are available. Thus with this metallogenetic and geochemical model in mind, importances of 2, 1, −1, and −1, were ascribed to Zn, Cd, Fe and Mn, respectively, and in the final weighting the signs for Fe and Mn will be negative to reflect that high Fe and Mn levels are considered an indication of oxy-hydroxide precipitation and false anomalies. Then, using the steps outlined previously, the robust estimates of mean and standard deviation are derived together with the weights (⇓Table 1), and the weighted sums computed (⇓Table 2).

A cumulative probability plot of the weighted sums (⇓Fig. 1) demonstrates that the three stream sediment samples (13, 15 & 16) collected below the known Zn–Cd–Cu exhibit unique values that discriminate them from the remaining samples. Of the three individual samples plotting at the upper tail of the background population, the two highest (12 & 9) reflect downstream dispersion from the showing below the confluence of the stream draining the showing with a larger stream. The high Zn falsely anomalous samples (22 & 23), due to the presence of Fe– and Mn–oxy-hydroxides, have been relegated to the tail of the background population. Thus, the weighted sums procedure has differentiated the five samples directly reflecting the presence of the showing from the remaining data, which includes a number of false anomalies resulting from the scavenging of Zn, and other metals, by Fe– and Mn–oxy-hydroxide precipitates in the stream bed.

This application may seem trivial, but its purpose has been to demonstrate the procedure. ⇓Garrett & Geddes (1991) have shown how weighted sums may be used to screen much larger sets of stream sediment, acquired in a regional stream sediment survey of Jamaica, in order to identify areas favourable for gold mineralization, and search for evidence of low-temperature hydrothermal mineral assemblages at surface that would indicate the presence of Cu-porphyry systems buried at depth.

## APPLICATION OF FORMAL STATISTICAL PROCEDURES: PCA AND X^{2} PLOTS

RQ mode PCA (⇓Zhou *et al.* 1983; ⇓Grunsky 2001) was undertaken using the correlation matrix for the Zn, Cd, Fe and Mn subset, as the measure of association between the elements. The resulting principal component loadings and percentages of the variation associated with each component are presented in ⇓Table 3. The first two components account for 93.24% of the variability in the four elements. An inspection of ⇑Table 3 and ⇓Fig. 2 reveals that the first principal component (PC1) is an indicator of mineralization, with strong positive loadings for all elements, and that the second principal component (PC2) is an indicator of Fe and Mn abundance, such that high levels of these elements lead to positive loadings. ⇓Figure 3 and ⇑Table 2 provide a comparison of the weighted sums and the scores on the second principal component (PC2). The relative rankings are similar, and the extreme positive scores on PC2 correspond to the extreme negative weighted sums. Both the RQ mode PC2 and weighted sums procedures distinguish two samples (22 & 23) exclusively associated with Fe and Mn, and three samples (13, 15 & 16) associated with Zn and Cd enrichment independent of Fe– and Mn–oxy-hydroxide interference.

As mentioned in the Introduction, the presence of outliers can significantly affect estimates of mean and covariance/correlation. Exploration geochemists are familiar with the fact that a few high values will significantly increase the estimate of a data set mean. By applying ‘robust’ estimation methods the effects of outliers can be diminished. By way of an example, the method of the minimum volume ellipsoid (⇓Rousseeuw 1985) was employed to compute robust estimates of the means and correlation matrix for the Zn, Cd Fe and Mn subset of the Howarth & Sinding-Larsen data. These estimates were then used to re-compute the PCA; the resulting robust PC2 scores (RPC2) are presented in ⇑Table 2. While the relative ordering of the samples remains largely unchanged, the absolute scores for the extreme samples have increased. The two falsely anomalous samples (22 & 23) have had their negative scores decreased by factors of 3 to 7, and the scores of the three truly anomalous samples (13, 15 & 16) have been increased by factors of 1.3 to 2.4, making both extremes more noticeably different from the rest of the data (i.e. core background).

Multivariate probability (χ^{2}) plots may also be used to identify geochemically anomalous samples (⇓Garrett 1989). The method is based on estimating the Mahalanobis distance, the multivariate extension of the normal score (see equation 5), of a sample from the data centroid (the multivariate mean). Again, this procedure uses robust estimators for the mean and covariance matrix to obtain the best possible estimators of the core background data relationship. Applications of this technique have been employed for assessing geochemical data in lateritic terrains (⇓Grunsky 2000). ⇑Table 2 presents the Mahalanobis distances for the test data set estimated using the same minimum volume ellipsoid procedure as used in the previously presented robust PCA. ⇓Figure 4 presents a visual comparison. The Mahalanobis distances clearly identify the atypical individuals as the extreme members of the data set farthest from the centroid (15, 16, 13, 23, 12, 22 & 9).These individuals include both the truly anomalous (15, 16, 13, 12 & 9), and the falsely anomalous (22 & 23) samples. An additional sample (10) with a high Mahalanobis distance is a weak false anomaly. Viewed overall, the extreme and outlying Mahalanobis distances form a ‘V’ when plotted against the weighted sums (⇑Fig. 4). The left arm includes samples influenced by high Fe and Mn levels, the false anomalies, and the right arm includes the Zn- and Cd-rich samples indicative of the mineral showing. Interestingly, the remaining samples can be divided into two groups (⇑Fig. 4), as would also be revealed in a probability (χ^{2}) plot of the Mahalanobis distances. The upper group are weakly anomalous samples due to both the mineral occurrence (21) and Fe-Mn accumulation (24 & 25), and the lower group, which itself appears to have a further three subsets, represents the true background samples closest to the data centroid. This can be rephrased geochemically as the Mahalanobis distances clearly identify the anomalous individuals as the extreme members of the data set most different from the average background.

## DISCUSSION

A comparison between the weighted sums, regular and robust principal component scores, and Mahalanobis distances demonstrates that the maximum contrast (i.e. ratio of anomalous level to average background; ⇓Rose *et al.* 1979) is achieved with robust Mahalanobis distances (*c.* 10 000) or weighted sums (*c.* 10). Although the former provides the maximum contrast, the computed values give no indication of the direction (high or low) from the centroid (average background) in which the individuals are extreme. By appropriate choice of sign when selecting weights, or relative importances, the weighted sum provides such additional information; in the present example (⇑Fig. 4), as to whether the anomaly may be true and related to mineralization (positive), or be false and related to Fe– or Mn–oxy-hydroxide accumulations (negative). An additional practical point is that, whilst sophisticated computer software is required to compute robust Mahalanobis distances, weighted sums can be generated with a spreadsheet.

Principal component approaches provide decreased contrast, with a robust procedure providing the greatest, as would be expected when outliers (anomalous individuals) are down weighted at the initial stages of computation. However, the advantage of principal components is that they provide a simultaneous visual assessment of both samples and variables during exploratory investigation of a data set (⇑Fig. 2). This is useful in its own right in gaining an understanding of the structure of the data set (its internal trends and relationships) and identification of outliers. In the current context, PCA can provide useful guidance in determining the weights and signs for a weighted sums scheme when there is inadequate available *a priori* geochemical and mineralogical knowledge on which to base a scheme.

It should be noted that no transformation was applied to the data prior to estimation of means and covariances/correlations. The need for a transformation (e.g. log_{10}) can be argued; it would have likely had the effect of further reducing the contrast between the extremes of the regular PCA scores. The reason for employing a transform would be to maintain homogeneity of variance, as required in least-squares based procedures. However, the use of the robust minimum volume ellipsoid procedure has the same effect, down-weighting the extreme members that can cause heterogeneity of variance problems. As there is no reason to expect that the background data in this small data set are drawn from a lognormal distribution, it was appropriate in this case not to transform the data. However, it must be noted that this may not be the case with larger regional data sets, where the range of the background data may extend beyond an order of magnitude due to presence of geochemically contrasting bedrock lithologies. In such cases, the application of a logarithmic or Box-Cox transform to the data may be appropriate prior to PCA or Mahalanobis distance calculations.

## CONCLUSION

The weighted sums procedure is a practical and simple way of combining multi-element geochemical data to focus on particular metallogenetic and/or geochemical associations. It is designed to highlight relatively rare occurrences of geochemical patterns in large data sets, and the weights are based on prior knowledge concerning the subject of the search. If this reliance on prior knowledge is seen as a deficiency, principal component or factor analysis will provide an initial assessment of the relationships in the data that can assist in determining what kind of weighting scheme might be employed. Conversely, it could be argued that, when inspecting ⇑Table 2, the similarity of the component score results to the weighted sums facilitates the interpretation of the component because of the geochemical knowledge foundation of the weighted sum.

The simplicity of the weighted sums procedure, its direct connection to relevant available geochemical knowledge, and the ease with which they can be calculated with current statistical software, or a spreadsheet, and then spatially displayed with GIS packages merits its greater use as a screening tool to highlight data subsets exhibiting patterns of relevance to mineral exploration and other studies.

## Acknowledgments

We wish to thank our colleague Graeme Bonham-Carter for his review of the typescript and his useful suggestions for improvements.

## Footnotes

↵For a normal distribution, the Interquartile Range, IQR, covers the band of data 25% (0.67449 standard deviation units) wide above and below the mean. Thus the IQR covers a range of 1.3490 standard deviation units in total, and the standard deviation, s, is equal to 0.7413*IQR for a normal distribution, where 0.7413 is the reciprocal of 1.349.

- © 2001 AAG/The Geological Society of London