Abstract
Geochemical data are complex, and becoming more so as analytical techniques advance. This complexity gives rise to many data management issues that have no simple solutions. The Geological Survey of Canada (GSC) has a long history of collecting and managing geochemical data. The challenges of managing the data are continuously evolving. The emergence of the WWW is causing a revolution in how scientific data are delivered. This paper documents how one group within the GSC is responding to this rapidly changing environment.
The first geochemical surveys were carried out in the Soviet Union in the 1930s (Garrett et al. 2008). Modern surveys are conceptually very similar to earlier surveys, but are considerably more complex in their details. The task of managing geochemical data has grown as the volume of data has grown. Several papers have been published that describe ‘best practices’ for managing these data (e.g. Darnley et al. 2005; Jarva et al. 2010; Johnson et al. 2005; Lehnert et al. 2000; Smith et al. 2003). The EarthChem initiative, http://www.earthchem.org, (Lehnert 2008; Lehnert & Brantley 2008) is particularly ambitious in its goal of bringing together all aspects of lithogeochemical data. The work described in this paper is different in that it attempts to bring together geochemical data from a wide variety of sources into a standardized structure.
The complexity in modern surveys arises from several sources:
the number of samples collected in a single survey can sometimes reach several thousand, and national or continental-scale projects may involve numerous surveys carried out over several years, to the same (or similar) specifications;
several different sample types may be collected at each sample site (glacial sediment, multiple soil horizons, water, vegetation, etc.);
a sample may be processed in various ways, before being subjected to chemical analysis (crushing, sieving, heavy and/or magnetic mineral separation, ashing, etc.);
the sample may be analysed by a variety of different methods, by different laboratories, over a time period of years or even decades.
Figure 1 is a simple flow chart that summarizes the ‘business process’ of conducting a geochemical survey. The challenge in managing geochemical data lies not in the volume, but in the attributes of the data, generally referred to as metadata. In order to interpret the results of a geochemical survey, it is essential to know exactly what was collected, how it was prepared and how it was analysed.
A simple ‘business process model’ for conducting a geochemical survey.
Traditionally, data from a geochemical survey have been published as a series of spreadsheets (or tables, in a printed document), accompanied by a text report, describing the collection, preparation and analysis of the samples. The quality of these publications has varied enormously. Many reports are missing critical pieces of information, and there is no standard approach to how the data are presented.
The data management approach presented in this paper provides a solid framework to standardize the management of geochemical data. However, it remains the responsibility of individual scientists to ensure that the metadata are accurately and completely recorded. Encouragement from organizational management can be helpful.
To describe the geochemical database, it is necessary to use computer jargon and acronyms. These are briefly explained in Appendix I and further information can be found by searching online.
Geochemical Data Management History at the GSC
Garrett (Appendix II) describes the history of computer usage in applied geochemistry at the Geological Survey of Canada (GSC). The following paragraphs give a very condensed history, documenting those aspects which have led to the database development activity that is the focus of this paper.
The GSC has been carrying out geochemical surveys since the 1950s (Brummer et al. 1987). In the 1970s, a Canada-wide programme to collect stream and lake sediment and water samples, following well-defined procedures, was begun. This evolved into the National Geochemical Reconnaissance (NGR) Programme (Friske & Hornbrook 1991). NGR protocols were consistent from year to year so that it is valid to combine data across NGR surveys to produce maps showing regional patterns (Painter et al. 1994).
Because the NGR data were so extensive and potentially useful, it was highly desirable to store all of the data in a single database, to simplify data management tasks. Work towards this goal began in the late 1980s. Oracle software, installed initially on a VAX minicomputer, and later on OS/2 and Solaris computers, was chosen as the relational database management system (RDBMS). Limitations of computing hardware and software, combined with organizational restructuring, caused the project to stall in the early 1990s.
Simultaneous with the NGR programme, many scientists across the GSC were carrying out geochemical surveys, collecting a wide variety of sample media, and sending the samples to many different laboratories for analysis. There was no standardized GSC-wide approach to how these surveys were conducted and published, or to how the data were archived, and this continues to be the case. The result was that, by the year 2000, most of the GSC’s geochemical data were stored in many places and formats, and were difficult for non-GSC researchers to discover.
Efforts to standardize the data holdings began in 2003 (Spirito et al. 2004). A simple database was created to catalogue surficial geochemical surveys carried out by the GSC. The catalogue provided only high-level metadata for each survey – information such as when and where the survey was carried out, and where the data were published. Initially there was no attempt to store any analytical data in the underlying database because of limited resources. The catalogue made it much easier for people to discover that the GSC’s geochemical data existed, but did nothing to make it easier to work with the analytical data. A significant amount of the data existed only in paper reports or in a confusing variety of digital formats. There was clearly a need to standardize the delivery of raw data to clients. This implied a need to store the data in a standardized format in a corporate database.
Beginning in the late 1990s, a comprehensive geochemical database model was designed and tested, as part of the Canadian Geoscience Knowledge Network (CGKN) initiative (Grunsky & Broome 2001). There has been a continuous effort since that time to improve the model, and populate the physical implementations of it. The remainder of this paper documents the challenges that have been encountered in this activity.
What Constitutes a Geochemical Survey?
The basic ‘business process’ (Figure 1) of carrying out a geochemical survey has not changed greatly over the years. Samples are collected in the field. They are brought to a laboratory facility, where they are subjected to ‘preparation’ prior to analysis (for example, crushing, sieving, drying, filtering). The prepared samples are then sent to different laboratories for chemical analysis. When the data are returned, they are verified, compiled and published. Although this basic process is recognizable by any geochemist, there are countless deviations that may occur in individual surveys. Some of the variations that may be encountered are listed below.
Multiple samples may be collected from a single site, and they may span multiple sample types (e.g. glacial sediment, soil, water, vegetation).
Analytical measurements may be made directly in the field. Measurements of pH are a common example. Sometimes the measurement may not involve a physical sample (for example, measuring radioactivity, or measuring mercury fluxes).
Sample preparation may happen in the field. For example, water samples may be filtered and acidified; sediment samples may be sieved and dried.
Sample preparation may involve splitting a single sample into numerous parts. Sieving of till and soil into different size fractions is a common example. An extreme example is the extraction of indicator minerals from a sample, and subsequent electron microprobe analysis of each crystal or grain.
Surveys may cover extremely irregular areas, and be spatially discontinuous. Sampling density may be very uneven.
Surveys may span multiple field seasons. Samples from different field seasons may be analysed by different laboratories.
Samples may be archived, and then re-analysed at a later date. The GSC maintains archives of samples going back over 50 years, and re-analyses of these old samples are continually being carried out.
Typically when geochemical data are published, they are reported in a spreadsheet format, with a set of columns which can be generalized as shown in Table 1. Depending on the complexity of the data, the published report may contain one or more spreadsheets. Ideally the report contains text that explains exactly what each column in the spreadsheet represents (i.e. metadata). However, the depth of information in these text descriptions is extremely variable from one report to another. In extreme cases, there is no text description, and what the column represents must be inferred from its name. For example, a column labelled Zn may reasonably be assumed to contain zinc analyses, but without more information in the text (or column header), the digestion, analytical method and measurement units are unknown. If the analysis involved acid digestion, it is critical to know whether it was a total or partial digestion. At the other extreme, the detail can sometimes be overwhelming for anybody other than analytical chemists. For an example, see the description of DC-arc emission spectroscopy by the British Geological Survey (1992). Excessive metadata in the original publication does not cause any problems for subsequent data processing, and scientists should be encouraged to record as much metadata as possible. On the other hand, missing metadata in legacy reports create severe problems, and can severely limit the usefulness of the data.
Generalized format of a spreadsheet of geochemical data
The variability in how geochemical surveys are reported causes problems for authors of the reports, end-users of the data, and people tasked with archiving the data in corporate ‘data repositories’. Table 2 summarizes some of the challenges of working with geochemical data, and highlights the question: How can things be improved? The obvious response is: ‘Store the data in a well-designed database’, but this immediately leads to questions of database design. Geochemical surveys have many things in common, but also many things that differentiate them, some of which are listed above. The challenge is to design a single database that will accommodate these differences, without becoming excessively complicated. Efforts to design such a database began at the GSC in the late 1990s (as part of the CGKN initiative), building on earlier experiences with the NGR data, described above. A basic design was quickly established, but continues to be modified as new challenges arise. The core tables in the database are shown in Figure 2 (simplified from Adcock et al. 2008).
Common issues in the recording of geochemical data
Geochemical database core tables.
Comprehensive documentation of the data model, including an implementation in Microsoft Access, is available in Adcock (2009b). Adcock et al. (2003) provide a short summary of the most important features. The current version of the database is referred to informally as CDoGS – the Canadian Database of Geochemical Surveys.
The usefulness of the database depends on how successful it is in two respects: firstly, safely archiving all of the GSC’s geochemical data; and, secondly, delivering data to end-users efficiently and easily. Both challenges necessitate maintenance procedures that are robust, easy to understand, and easy to extend.
Computer Technology – A Moving Target
Computing technology evolves extremely rapidly. Large government institutions do not. Implementing state-of-the-art information management systems within government is challenging. With a small team and a limited budget, some systems may take several years to implement, by which time the technology has evolved to enable alternative approaches. It is possible these new approaches could then be implemented very quickly, even by a small team with a small budget. Two examples of this situation are given below.
(1) Computer memory limitations
The earliest PCs were limited to 640 K of memory. An even more severe limitation in many programming languages was a 64 K memory limit on the space that could be occupied by character strings. This limit made it very difficult to work with large text files. Currently, with a 64 bit operating system, large amounts of memory, and software to take advantage of it, extremely large text files (greater than 1 GB) can now be processed easily.
(2) WWW mapping
End-users of geochemical data want an overview of the contents of a particular data set before they invest a lot of time importing the data into an application (GIS, statistics, etc.). It is beneficial to offer end-users a basic map of sample locations over the Internet. Until recently, this capability was time-consuming and expensive to develop. The recent emergence of KML as a standard for geographical data visualization has made it easy to provide this capability.
The role of relational databases
The volume of data that is generated by a modern geochemical survey is large, and would have been overwhelming a few decades ago. A typical till survey may involve 1000 samples, separated into two size-fractions, and each fraction submitted for ICP-MS/AES (~50 elements) and INAA (~ 35 elements) analysis. Consequently, there will be 1000 × 2 × (50+35) = 170 000 analytical results. Most scientists manage their data using spreadsheet software, but this is not an ideal approach. It is too easy to accidentally corrupt the data without realizing it. A preferred approach is to use database software, but this requires some forethought as to how the database should be designed. This, in turn, requires a basic knowledge of database design principles. As a result, many scientists fall back on the spreadsheet approach.
There are several different database technologies available, but relational databases are overwhelmingly dominant in the marketplace. It is beyond the scope of this paper to describe the principles of relational database design. There are many excellent textbooks and websites that cover this topic and Wikipedia is a good starting point. Even the simplest relational database cannot be designed without a basic knowledge of concepts such as: (a) primary and foreign keys; (b) one-to-many and many-to-many relationships; and, (c) normalization (eliminating redundant data). Designing a database that is capable of storing data from many different geochemical surveys is a challenge, and the design can never be regarded as final. Sampling, preparation and analytical techniques continue to evolve, which may trigger changes in the design to accommodate new data.
Corporate archival systems must be built with a focus on long-term stability and the skill-sets of future employees must be anticipated. If the group of people tasked with maintaining the system is small, great care must be taken to facilitate knowledge transfer in the event of staff turnover. These considerations have led to the implementation of the following maintenance procedures for the CDoGS geochemical database.
The database is designed in a manner that allows it to be implemented in different RDBMS software. It has been implemented in different versions of MS Access, SQL Server, Oracle and Ingres. It avoids any reliance on stored procedures within the database, which are not portable across different software packages. For example, Oracle stored procedures are written in the PL/SQL programming language, SQL Server uses the T-SQL language and MS Access uses VBA.
There is no monolithic application that is used to maintain the database. Instead, there are a few dozen short, simple command procedures. The procedures are written for the MS Windows operating system, but could easily be rewritten for a Unix-based system.
XML is used extensively for getting data in and out of the database. The format of the XML files is rigidly constrained by XML schemas. Manipulation of XML files is carried out by XSLT.
SQL is used to manipulate data within the database. Care is taken to minimize the use of RDBMS-specific SQL dialects.
The system uses a mix of software that is either: (a) part of the MS Windows operating system, or freely available from Microsoft (e.g. VBScript, C# compiler); or, (b) open source, free software (e.g. Saxon XSLT processor, Xerces XML validator).
These design decisions have led to a system that is relatively easy to maintain and easy to understand. Occasionally, it is necessary to develop more complex procedures to handle particular aspects of the data management. Where possible, these procedures have been implemented using VBScript programming. If VBScript is inadequate, then programs have been written in C#. The entire system is tied to the MS Windows operating system, but the highly modular design ensures that individual procedures can be converted easily to other operating systems.
The modular design exists at two different scales. At the micro scale, there are the individual command procedures. At the macro scale, there are four modules: (1) database loading; (2) internal database manipulation; (3) database export; and, (4) WWW content generation. Each of these four modules has its own set of challenges, which are discussed below.
Database Loading
Three aspects of populating the database will be addressed: (1) cataloguing the geochemical surveys; (2) populating with legacy geochemical data; and, (3) integrating with a Laboratory Information Management System (LIMS).
Cataloguing
Historically, the GSC has not maintained a central catalogue of geochemical surveys, so the building of this catalogue started from scratch. Identifying surveys to be included in the database is straightforward if the work has been published, but involves a considerable ongoing effort to search the literature and communicate with past and present staff to locate unpublished material. The situation is further complicated by the incorporation of non-GSC surveys (Provincial/Territorial, university, industry, other federal agencies) because they are commonly harder to identify, and the published reports can be difficult to find. Once a survey is identified, basic metadata are created (e.g. geographical extent, sample media, date, principal investigator). Associated publications are catalogued and linked. Relationships to other surveys are established, based on various attributes (funding source, sample media, geographical location, etc.).
Legacy data loading
The development of a straightforward, efficient procedure for loading legacy data into the database was a major challenge. Legacy data for a survey are typically spread across several files, and may be in a variety of sometimes obsolete data formats (e.g. Lotus 1-2-3, MS Excel®, WordPerfect®, Paradox, dBase, CSV, 80-column Hollerith card images, etc.). A multi-stage approach has been found to work well, as set out below.
Capture the metadata that describe the analytical methodologies.
Archive the original data files in their original formats.
Import the data into MS Access ‘as-is’ into temporary tables. Do not apply any corrections to the original data prior to importing into MS Access.
Within MS Access, correct the imported data via SQL statements, which are stored in a SQL script file. For example, ensure that the sample IDs are written in a consistent way across all of the temporary tables, so that records in different tables can be joined by SQL statements. MS Access lacks the capability to execute SQL scripts, so a simple VBScript program was written to accomplish this.
Within MS Access, copy the data from the temporary tables into a pre-defined table with columns that correspond to those in a spreadsheet of data from a single analytical laboratory. The spreadsheet is a highly de-normalized view of the data.
Normalize the data. Take the data from the pre-defined spreadsheet table and load them into fully normalized tables.
The mechanics of Step 6 are identical for all surveys. All of the effort is focused in steps 4 and 5. The SQL statements that must be created for these steps are invariably unique to each survey, but there is a common pattern to manipulating the raw data, and so the SQL scripts for each survey have broad similarities to one another.
A key advantage of this approach is that the data manipulation is completely reproducible, and can be fully documented within the SQL scripts. This allows surveys to be easily and quickly re-loaded into the database if new data or corrections come to light. It also allows subsequent researchers to understand how and why the original source data were modified.
Prototyping is done with lightweight desktop database software (MS Access), rather than with robust high-end RDBMS software (Oracle, SQL Server, etc.), because the lightweight software is far more user-friendly, and allows much faster creation and debugging of SQL statements.
LIMS integration
The primary function of a LIMS is the management of samples. The first step in sample management is typically initiated when a sample is received in the laboratory and then registered in the LIMS. This registration process may involve producing a barcode to label the sample container. Various other parameters will usually be recorded, including sample type and origin. The LIMS will then track the migration of the sample through the laboratory.
A primary goal of a LIMS is to simplify data management, and to minimize the possibilities of data loss or corruption. The GSC has an ongoing initiative to streamline the data flow within its laboratories. The goal is to have the geochemical database tightly integrated with any LIMSs that are developed. Much work remains to be done in developing the LIMS before this integration can be achieved.
Internal Database Manipulation
The database is fully normalized, which means that the data are organized so that redundancies are minimized, i.e. the same information is not stored in more than one column or table. A fully normalized database eliminates many possibilities for data corruption and ensures that any data query or manipulation can be easily implemented via SQL. However, the normalized design does not correspond to the spreadsheet view of geochemical data that geochemists are used to seeing, and expect to see. Transforming from the normalized design to a spreadsheet structure (creating a pivot table, in computer jargon) is computationally expensive. It is impractical to do this transformation in ‘real time’ in response to end-user queries. Therefore, the data are converted to a set of ‘derived’ tables via SQL scripts that are executed whenever new data are loaded into the database. This approach is viable because raw data are added only once every few weeks or months, in large batches. The de-normalizing that is involved in creating a spreadsheet view leads to a structure that is very awkward to work with via SQL, but produces tables that are very easy for end-users to understand and work with.
In addition to de-normalization design considerations that may apply to any large database, there are factors that are quite specific to geochemical databases: qualified numbers; multiple analytical methods; geographical coordinates; and, geochemical terminology and keyword searches.
Qualified numbers
The analytical data reported by laboratories are ‘qualified numbers’. They are qualified in that they are not always simple numbers because they are sometimes reported as ‘below detection’ or ‘above detection’. They may also be reported as ‘missing’ because of analytical problems, commonly due to insufficient sample weight or interference. A good geochemical database must be able to handle these different possibilities.
The approach implemented here is to store each analytical value as two fields. The first field is a true decimal number, reflecting the reported analytical value. The second field is an integer ‘lookup’ code, which qualifies the value in the decimal field. If the analytical value is a routine analytical measurement, the lookup field is set to zero. If the analytical value is ‘below detection’, the decimal value is the detection limit and the lookup field is set to 1. If the analytical value is ‘above detection’, the decimal value is the upper detection limit and the lookup field is set to 2. If the value is ‘insufficient sample’, the decimal field is left empty (null), and the lookup field is set to 4. The set of lookup values can grow as new possibilities arise.
When the data are exported as spreadsheets, these ‘qualified numbers’ must be exported as single fields. There is no ideal way to do this. End-users will have different preferences depending on what they intend to do with the data. Therefore, the GSC CDoGS geochemical data website (described below) provides Excel® spreadsheets in three different formats:
text (<2 reported as <2);
half detection (<2 reported as 1);
below detection replaced by negative (<2 reported as -2).
It should be noted that the concept of ‘qualified numbers’ in geochemical data storage goes back many decades. The United States Geological Survey Rock Analysis Storage System (RASS) employed this concept using the Geologic Retrieval and Synopsis Program (GRASP) file management system (Bowen & Botbol 1975). At the GSC, R.G. Garrett’s Interactive Data Exploration and Analysis System (IDEAS) software implemented a version of GRASP modified to use a keyed ISAM file structure (Garrett 1988). The data analysis and interpretation problems that arise from these qualified numbers are covered in detail by Helsel (2005).
Coping with the multitude of analytical methods
Analytical methods employed by laboratories are continuously evolving. This leads to a challenge in geochemical database design: how should minor differences in analytical methods be managed? Should two methods that are almost identical be generalized to a single method? For the GSC database, the decision was made to preserve every minor difference that was reported in the original publications. The alternative would have required making arbitrary judgements as to whether or not methods were so similar as to be effectively identical.
Ultimately, the design decision to preserve all of the analytical variations has led to a huge number of different methods in the database, which are very confusing for end-users to cope with. Measures have been taken to simplify the end-user’s standard view of the data. The expertise of analytical chemists and practising geochemists is used to create views of the data so that different data sets, obtained by slightly different analytical methods, can be legitimately combined for the end-user.
Within the database, analytical methods are defined by several criteria:
analyte (e.g. Zn, pH, dissolved oxygen, LOI);
analytical decomposition (e.g. aqua regia, fusion);
analytical technique (e.g. ICP-MS, AAS, INAA);
lower and upper detection limits;
instrument manufacturer.
The different analytical methods are grouped into ‘suites’ and ‘packages’, to simplify their management. A suite of methods corresponds to data obtained by a single analytical procedure, so that within a suite, all the elements being reported have gone through the same steps in the lab (decomposition, instrument measuring the concentrations). For example, an INAA suite consists of up to 35 analytical methods, corresponding to the 35 elements analysed. A package is a collection of suites that is offered by a laboratory. For example, a laboratory may offer a package that is a combination of: (a) a suite of eight elements analysed by flame AAS; (b) tungsten analysed by colorimetry; and, (c) mercury analysed by CV-AAS. The three suites together constitute a single package, which is specific to a particular laboratory. A laboratory may offer several different suites, which slowly change from year to year. The different suites can be combined into a variety of different packages. The concept of an analytical package is very useful for managing the data-loading process, and also for the delivery of analytical data as spreadsheets to end-users.
Geographical coordinates
Geographical coordinates can be expressed either as latitude/longitude pairs, or as projected coordinates (Van Sickle 2010). In Canada, the most common projection is UTM (Universal Transverse Mercator). In either case, the datum must also be specified in order to fully define the geographical location. For most of the twentieth century, virtually all of the maps in Canada were referenced to the North American Datum of 1927 (NAD27). The situation became more complicated in the 1970s, as a worldwide effort to move to a new datum progressed. This culminated in North America with the adoption of the NAD83 datum. The 1980s and 1990s were a time of transition from NAD27 to NAD83.
Datum terminology is confusing, and there are references to several other datums in the literature. Mugnier (1997) gives a concise summary of the Canadian situation. Often, the differences between these datums are insignificant (less than one metre) for geochemical surveys. The difference between NAD27 and NAD83 locations is significant (up to several hundred metres), and varies in a complex manner across Canada (Henderson et al. 1989; Pinch 1990). Transformation from one datum to the other can be accomplished by various methods, but the preferred method in Canada is to use the NTv2 routine, developed by the Geodetic Survey of Canada (Junkins & Erickson 1996). Transformations by other methods can differ from the NTv2 transformation by several metres. Implementing the NTv2 routine is not trivial. Commercial software that implements NTv2 can be found online at http://www.geod.nrcan.gc.ca/tools-outils/ntv2_e.php (accessed July 17, 2012).
When NAD83 was initially established, it was, for all practical purposes, identical to WGS84, the datum that is used by GPS satellites. NAD83 and WGS84 have both evolved since then (Craymer 2006). NAD83 is now more precise, whilst WGS84 has shifted geospatially. The end result is that the two coordinate systems now differ by about 1.5 metres. This difference is considerably less than the typical measurement error associated with geochemical sample locations, but it is a potential source of confusion in datum transformation computations.
KML files, which are commonly used in Earth browsers, require that geographical coordinates are expressed as latitude and longitude, using the WGS84 datum.
The CDoGS database stores geographical coordinates in four different systems: (a) lat/long NAD27; (b) UTM NAD27; (c) lat/long NAD83; and, (d) UTM NAD83. The coordinates are all precisely equivalent, based on the NTv2 transformation. Maintaining the four sets of coordinates provides a lot of flexibility in accommodating the needs of end-users. It also allows the legacy data to be preserved ‘as-is’, to facilitate cross-referencing back to the original publications.
Geochemical terminology and keyword searches
It is extremely important that the database can handle queries such as ‘select all of the INAA gold data for the <2 µm size fraction of till samples’. This requires that the database includes definitions of ‘gold’, ‘INAA’, ‘till’ and ‘<2 µm size fraction’. Furthermore, these terms must be used consistently across all of the surveys in the database. Achieving this consistency is difficult but is made easier by creating a list of keywords in the database.
The database model strives to maintain the terminology and classification of the scientist responsible for the survey. This can be difficult because geochemical terminology is continuously evolving and it is not unusual for common terms from earlier decades to be replaced with new ones. In addition some terms may have multiple meanings, depending on their context and scientists have individual preferences. To minimize these issues, the terminology is referenced to standardized classification schemes as much as possible.
Scientific terminology within the database is separated into five categories, listed in Table 3. The geological keywords are a subset of keywords from the Georef thesaurus (Goodman 2008), with a few additions where there was no equivalent term. Definitions of these keywords are taken from the Glossary of Geology (Neuendorf et al. 2005) wherever possible. For three of the categories (analysed material, analytical methods, decomposition techniques), hierarchical classification schemes are being developed. For example, the hierarchy for ‘basal till’ is ‘Sample >> Lithosphere >> Unconsolidated >> Terrestrial >> Glacial >> Till >> Basal till’. A key advantage of a hierarchical classification is that it makes it easy to retrieve all of the data below a particular node. For example, a query for ‘till’ samples can be designed to also retrieve data for subcategories of till, such as basal, ablation, deformation, and so on.
Keyword categories
Database Export
In theory, end-users could be given read-only access to the database via a WWW interface and HTML pages could be built dynamically in response to the users’ queries. This approach was rejected for two main reasons: (1) it would be computationally expensive, and response times would be unacceptably slow; and, (2) it would require a complex application to build the web pages, which would be difficult and expensive to maintain. For these reasons, the decision was made to present the database to the end-user as a set of static files (HTML pages, KML files and Excel® spreadsheets), which would accommodate the needs of the large majority of end-users. It is expected that technological advances will lead to new approaches in the future
The contents of the database metadata tables can be easily and efficiently exported as XML files, with each file corresponding to a single table in the database. This is not a practical approach for the tables that hold the raw data (analytical values, sample locations, etc.), because the resultant XML files would be enormous and extremely difficult to work with. Therefore, these raw data tables are exported to multiple files, based on the survey or the analytical sample bundle.
In order to simplify working with the exported XML data, the individual files are combined (via XSLT) into a few XML files. This final set of files conforms to an XML schema, which tightly controls their structure and content. Subsequent transformation into XHTML files (again via XSLT) is straightforward.
The most useful way to deliver raw analytical data to end-users is by using Excel spreadsheets. Exporting Excel files programmatically from a database is a trivial task using MS Windows VBScript COM programming (assuming that MS Excel® is installed on the computer). Implementing this capability under other operating systems would be more difficult.
WWW Content Generation
It is assumed that end-users will access the contents of the database via the WWW. A basic website to facilitate this access was one of the first tangible products of this initiative (home page: http://gdr.nrcan.gc.ca/geochem). It was created as part of a broader GSC initiative to present geoscience data sets online, via a single URL – the Geoscience Data Repository (GDR). Since its creation, the GDR/geochem website has continued to evolve. Procedures have been created that enable the contents of the entire website to be generated automatically from the exported XML files. These procedures are complicated by the requirement that official Government of Canada websites must conform to ‘common look and feel’ (CLF) rules, as well as be in both English and French. In addition, the intent is to have the website compatible with multiple browsers, but this can be very challenging to achieve, especially because of the CLF constraints. At a minimum, the website attempts to be compatible with both Internet Explorer (v7 upwards) and Firefox. The CLF specification is very hard to work with if the client computer is configured for something other than the standard Windows 96 DPI setting.
The web pages are XHTML-compliant, and use the Unicode UTF-8 character set. Adopting these two standards simplifies database maintenance procedures.
When the website was first created, one of the highlights was a geospatial query interface, based on University of Minnesota Mapserver software (Mapserver Team 2011). This interface continues to be maintained and enhanced, but it is based on a complex codebase of PHP on the server, and JavaScript on the client computer. Maintenance is difficult and time-consuming.
The goal of the GDR/geochem website is to enable the discovery and downloading of geochemical data. It is not intended to provide any data interpretation. As part of the discovery role, it is important to provide a basic method of viewing the data geospatially. The introduction of Google Earth™ in 2005 revolutionized the dissemination of geospatial data via the WWW. End-users now had access to an intuitive map viewer, with fast, free, high-quality satellite imagery. This freed developers from the burden of providing base maps for their geospatial data. Google Earth™ also introduced the KML data format for visualization of geospatial data. This XML-based format is both powerful and simple, and therefore favoured from a programmer’s perspective. KML was quickly incorporated into the website as a way of allowing end-users to get a quick geospatial view of the data (Adcock et al. 2009b).
An unfortunate consequence of the complexity of geochemical data is that the website itself has become complex. Scientists who have practical experience in conducting geochemical surveys should understand this complexity, and have a basic sense of how to navigate through the website. Others will find it more challenging. A user manual that describes how to navigate around the website was created in 2009 (Spirito & Adcock 2009b), and will be updated to reflect the recent changes to the site.
A long-term goal is to deliver geochemical data in a format that can be immediately integrated with other geospatial data taken from other websites, to create multi-layered maps. The Open Geospatial Consortium (OGC) is working towards this goal and has established several standards (KML being one) that have been implemented in many organizations. Many of these standards require a ‘critical mass’ of websites to adopt them, before they are truly useful for end-users. They also require considerable resources to set up and maintain. As a result, the implementation rate of these standards has been slow. GeoNetwork (GeoNetwork 2011) represents an evolution of these technologies towards easier installation and maintenance. Until the situation evolves further, the most reasonable approach for an organization with limited resources is to adopt a ‘wait and see’ attitude.
Advertising the Data
The number of potential users of geochemical data is very large. Only a small fraction of those users know that the GSC holds a large amount of data that is freely available. Furthermore, many potential users would be interested in geochemical data only as one of many data sets to be integrated into a multi-disciplinary study. If the data are not easily accessible and usable, they will be ignored.
Advertising geochemical data is difficult because many potential users are unaware of the data’s existence and usefulness. Other geospatial data sets face the same problem. Metadata initiatives that have been pursued over the last 10 to 20 years attempted to address this problem, but have had limited success to date. Two initiatives that have been followed at the GSC are the Canadian Geospatial Data Initiative (CGDI) and the Canadian Geoscience Knowledge Network (CGKN). The relevant URLs (accessed July 17, 2012) are:
http://geodiscover.cgdi.ca/wes/RecordSummaryPage.do?uuid=93DD61EC-7046-F089-8538-4CBCC6588ADB&recordLocale=en_US&view=summary&entryPoint=jsMap&mode=unmappable (will take you directly to the ‘Canadian Geochemical Surveys’ page);
http://cgkn.net/cur/datacat/index.php (select ‘GSC - Canadian Geochemical Surveys’).
The underlying technology of CGDI and CGKN uses the Z39.50 protocol. The metadata that are presented by the portals conform to the CSDGM v2 metadata standard (FGDC 1998). This standard was designed specifically for geospatial data and does not address any of the special characteristics of geochemical data. The two portals have had limited success in advertising the GSC’s geochemical data. Generic WWW search engines have proven far more effective.
The International Standards Organization (ISO) has recently published recommendations for geospatial metadata. ISO/TS 19139 (Technical Committee ISO/TC 211 2007) is particularly relevant for delivering geospatial metadata in a standardized XML format. There are mechanisms within the standard for extending it to handle specialized categories of data. As metadata standards evolve away from Z39.50 towards HTTP and XML, it should be possible to build more flexible and user-friendly data discovery portals.
Long-term Survival of the Data
The long-term survival of scientific data in government institutions is a significant challenge worldwide, exacerbated by the switch from paper to digital publication. Archiving procedures in research libraries have evolved gradually over centuries. They now need to be radically redesigned. However, the challenges are hard to document and quantify; they tend to be anecdotal and apocryphal, and difficult to verify. Obvious issues to consider include:
long-term survival of the physical media – diskettes, CD-ROMs, DVDs, all have finite lifespans, which are generally much shorter than printed documents;
proprietary data formats (Oracle databases, Excel spreadsheets, etc.) that are extremely difficult to decode without the associated proprietary software;
proliferation of slightly different versions of the same data, with no information as to what changes were made or why.
Government scientific websites have been difficult to maintain, as departments and agencies continuously re-organize and the rules and regulations governing web publishing rapidly evolve. Large scientific relational databases are a fairly recent development, and their ability to survive re-organization and staffing changes is uncertain.
At the GSC, the large amount of legacy data that is being used to populate the geochemical database has survived only because of a fortuitous series of staffing actions and scientific staff decisions. An equally large amount of data has probably been lost, and another large amount is available only as hard-to-read paper copies because the digital data have been lost over the years.
In this environment, it is advisable to have backup strategies in place for ensuring the long-term survival of the data. The GSC is fortunate in having an ‘Open File’ publication mechanism, that officially releases large data sets on DVDs, and these can be subsequently archived in other institutions. This greatly increases the probability that the data will survive. The geochemical database is too large to release on a single DVD, but it can be repackaged into smaller data sets, and each one released as a stand-alone Open File (cf. Adcock 2009a; Adcock et al. 2009a; Spirito & Adcock 2009a). As experience with large scientific databases and websites is gained, alternative approaches may emerge.
Conclusion
The challenges of managing geochemical data are steadily evolving in response to: (1) an increasing volume of data; (2) a rapidly evolving computing environment; and (3) the changing expectations of end-users.
The Canadian Database of Geochemical Surveys (CDoGS) is the GSC’s main database and WWW portal for geochemical data. It is continuously evolving in response to the various issues outlined above. Mirrored versions of the database are currently being maintained in SQL Server 2008, Oracle 10 and Ingres v9. The database contains comprehensive metadata for over 1200 surveys, which amount to several hundred thousand field samples. Raw analytical data have been loaded for about 90 of these surveys, which amount to over 50 000 field samples, including 20 000 from the NGR Programme. The content of the database is still incomplete and there are some major gaps to be filled, most notably, lithogeochemical surveys.
The contents of the database are accessible to end-users via the WWW portal, http://geochem.nrcan.gc.ca (accessed June 14, 2013), as:
HTML pages, which summarize the metadata for each of the 1200+ surveys;
Excel® spreadsheet files, which deliver the analytical data in a standardized format for 90 surveys;
KML files, which allow a quick visualization of the analytical data for 90 surveys.
CDoGS contains only a small fraction of the geochemical data that have been published by the GSC, but it has been carefully designed and tested. As geochemical data continue to evolve, the database should be able to accommodate the changes. The amount of data in the database should grow rapidly over the next few years.
Acknowledgments
This paper is a contribution to the GSC’s Geo-Mapping for Energy and Minerals (GEM) Programme (2008–2013). The manuscript benefitted greatly from the comments of S. Smith from the USGS. This paper is published as Natural Resources Canada contribution ESS 20120105.
Appendix I: Computer acronyms
This appendix provides very brief explanations for the computer acronyms that appear in the paper. The explanations are derived primarily from Wikipedia, which is an excellent starting point for further reading.
CSV A data file format where the data are stored as plain text, separated by commas.
CSDGM Content Standard for Digital Geospatial Metadata – a metadata standard maintained by the Federal Geographic Data Committee (FGDC) in the USA.
DPI Dots Per Inch. A combination of a large monitor and a standard DPI setting of 96 DPI lead to tiny text. The usual solution is to increase the DPI setting to 120 or 144 DPI.
HTML HyperText Markup Language is the markup language that is used to display information in web browsers.
HTTP HyperText Transmission Protocol is one of several high-level protocols used to transfer data across the Internet. It is the protocol used to transfer HTML documents. It exists alongside other high-level protocols such as FTP (File Transfer Protocol) and Z39.50.
ISAM Indexed Sequential Access Method. A system that permits a file to be read sequentially, or records to be selectively retrieved on the basis of pointers to individual records.
KML Keyhole Markup Language is an XML schema for expressing geographical annotation and visualization within Internet-based, two-dimensional maps and three-dimensional Earth browsers (e.g. Google Earth™).
PHP PHP is a general-purpose scripting language designed for web development, to produce dynamic web pages.
PL/SQL A procedural programming language extension to SQL, specific to the Oracle RDBMS.
RDBMS Relational Database Management System.
SQL Structured Query Language is a database computer language designed for managing data in relational database management systems (RDBMS). The language has a syntax similar to English sentences, such that the meaning of simple statements is often self-evident even to non-programmers.
T-SQL Transact-SQL is a procedural language extension to SQL, specific to SQL Server.
UTF-8 UTF-8 is an implementation of the Unicode character set. It has become the dominant character encoding for the WWW. Unicode encompasses over 100 000 characters and symbols, covering most of the world’s writing systems. It is a superset of the ISO-8859-1 character set.
VBA Visual Basic for Applications – a programming language that is integrated into the MS Office software suite, including MS Access.
WWW The World Wide Web is a system of cross-referenced (hyperlinked) documents distributed across millions of computers, and all connected to one another by the Internet. The documents are predominantly in HTML format, and can be accessed from an end-user’s computer via HTTP.
XHTML eXtensible HyperText Markup Language extends HTML. An XHTML file is also an XML file.
XML Extensible Markup Language is a set of rules for encoding digital data. The resulting documents have a hierarchical structure. The rules are very simple. The contents of XML documents can be constrained by XML Schemas (XSD), and transformed by XSLT (just one of several approaches).
XSD XML Schema Definition is used to express a set of rules to which an XML document must conform. An XSD file is also an XML file. XSD allows the content of an XML file to be very tightly constrained.
XSLT Extensible Stylesheet Language Transformation is a declarative language used for the transformation of XML documents. An XSLT file is also an XML file.
Z39.50 Z39.50 is a client–server protocol for searching and retrieving information from remote computer databases. It is a pre-WWW technology originally designed for library databases and is difficult to work with.
Appendix II: The early development of computing at the Geological Survey of Canada with special reference to applied geochemistry
There is a long history to the application of computers to geoscience at the GSC. The first installation in the former Department of Mines and Technical Surveys was an IBM 1620 in 1962. The main reasons for this acquisition were the needs of the Geodetic Survey of Canada and their requirements for least-squares adjustments of the primary geodetic framework upon which all other topographic map preparation was based. The first reference to the use of computers in the Department was in a paper by two geodesists (Wickens & Jones 1960) using a U.S. Coast and Geodetic Survey IBM 650 that they had installed in 1957. In December 1963 the Director of the GSC, Yves (Y.O.) Fortier, authorised a pilot project to evaluate the applicability of computers to GSC studies. As part of that project, the main participants using the IBM 1620 were:
petrologists, such as Neil (T.N.) Irvine, working on the Muskox Intrusion as part of the International Upper Mantle Project;
geomathematicians led by Frits (F.P.) Agterberg working on petrological and mineral deposit projects;
Bimal (B.K.) Battacharyya working on theoretical potential field geophysical studies; and
Ken (K.R.) Dawson, leading an institutional petrochemical database project, GEODAT.
The advent of data processing made it possible to organize and collate field data through its systematic entry on field cards, based on the 80-column Hollerith format. The field record was then transferred to punched cards. In 1964, a field card was developed for use in the Operation Keno regional stream sediment reconnaissance survey. This initial design was modified, based on Operation Keno experience, and used in the Operation Bathurst stream sediment survey in 1965 (Gleeson & Tupper 1967). Subsequently, the design was further modified and additional cards were designed for rock, till and soil, and lake sediments and waters (Garrett 1974a).
In 1965 the decision was made to move from the IBM 1620 to a CDC 3100 mainframe. This was driven by the increased numerical precision, desirable for geodesy computations, of the CDC machines. It was generally agreed that CDC was building the best ‘number-crunchers’ at that time. The availability of the new machine, installed in the basement of 588 Booth St., brought in more GSC users:
geophysicists studying gravity in the Earth Physics Branch of the Department, and geophysical survey data in the GSC (the latter included Peter (P.H.) McGrath, and later Mike (M.J.) Holroyd); and
geochemists, notably Eion (E.M.) Cameron and Chris (C.F.) Gleeson.
A major supporter of the application of computers at the Geological Survey was S.C. (Binks) Robinson who was Chief of the Petrological Sciences Division (1960–67). When management decided to ‘get serious’ about computing in Canadian geoscience, Robinson was appointed to lead an ad hoc committee on storage and retrieval of geological data. This committee was part of the National Advisory Committee on Research in the Geological Sciences (Robinson 1966, 1970). Robinson was later appointed as chairman of the IUGS Committee on the Storage, Processing and Retrieval of Geological Data, COGEODATA. In October 1964, Bill (W.C.) Krumbein from Northwestern University was invited by Robinson to visit the GSC. Krumbein, recognized as a founding-father of quantitative geology, had discussions with Survey scientists on the potential application of computers and statistics to their projects. In 1965, a Data Processing Unit was added to the Division and it became the ‘home’ for the computer-using petrological, geomathematical and database staff. This group included Ken Dawson, who managed GEODAT, the Geological Survey’s first institutional geoscience database (Dawson 1971). The geophysicists and geochemists investigating computer applications worked directly out of their respective administrative Sections. In 1967 the mineral deposits geologists, mineralogists, geochemists and geomathematicians were gathered together into the Geochemistry, Mineralogy and Economic Geology Division under Robinson’s leadership. From his senior management position, Robinson played an important role in the development of computing services at the GSC. Without his support of computer-related activities in the 1960s, it is likely that computer use at the GSC would have evolved differently from the way that it did.
The role of computers to support studies being carried out in the Geochemistry Section was first investigated by Eion Cameron. With the arrival of the CDC 3100, Cameron employed multivariate statistical analysis procedures in his studies of the geochemistry of sedimentary rocks, particularly those associated with oil and gas pools (Cameron 1967, 1968). Chris Gleeson had commissioned the departmental Computer Science Centre to develop data plotting software for the newly acquired Calcomp 663 drum plotter which was used to plot up the data from the 1965 Keno Hill stream sediment survey (Gleeson 1966; Boyle & Gleeson 1972). These maps were used to check the sample site locations and then passed on to the GSC’s cartographers for map production. In 1967 two new staff members joined the Geochemistry Section, Bob (R.G.) Garrett and a computer programmer, Dave (J.D.) Hobbs.
In the following years, a wide number of mainframe programs were developed and published by Geochemistry Section staff (Garrett 1974a, b; Soukup 1971). The CDC 3100 had its limitations, and after 1968 use was made of the federal Bureau of Management Consultants’ IBM 360 on Wellington Street. Later, commercial off-site computing services provided by Computel Ltd’s Sperry Univac 1108 and System Dimension Ltd’s IBM 360 were both accessed through terminals in various Booth Street buildings.
In April 1972, the use of interactive graphics was first investigated in a pilot project to build a Geochemical Interactive Graphics System (GIGS) for Geochemistry Section staff. GIGS was developed by Computer Science Centre staff under the leadership of Ian (I.K.) Crain and employed the services of Dataline Systems Ltd. and their DEC PDP10/50 in Toronto. Connection was initiated from a Tektronix 4002A graphics terminal at the GSC through a 300 baud link to Dataline’s Ottawa office, and then by a 4800 baud link to Toronto. In addition to the graphics terminal, a Tektronix 4601 hardcopier was available for screen capture of tables and figures. GIGS was developed for use with lithogeochemical data, and included a file manager as well as tools for table generation, and plotting both ternary diagrams and frequency distributions (Crain 1974; Garrett 1974b). The project was terminated in February 1973 when it was decided that the sole source of computing in the Department would be a CDC 6400 mainframe managed by the Computer Science Centre on to which all Departmental computing would be consolidated. The CDC Operating System was designed for batch, rather than interactive, computing. Development of systems to allow interactive graphical display of geochemical data became very difficult. It was a setback that took 11 years to overcome, but the appetite for this approach had been whetted and useful lessons learnt. Despite the setback, a number of stand-alone interactive graphics applications were developed over the next decade. The most successful of these was a projection pursuit method for displaying multidimensional data on a plane (Friedman & Rafsky 1981) that was used for both outlier detection and robust regression investigations (Garrett et al. 1982; Garrett 1983; Howarth & Garrett 1986).
In 1973 and 1974, large regional geochemical centre-lake sediment surveys were undertaken on the island of Newfoundland and in northern Saskatchewan, with joint Federal-Provincial funding. In 1975, the protocols developed for these surveys, themselves based on almost 20 years of GSC regional geochemistry experience, were adopted into the Federal-Provincial Uranium Reconnaissance Programme (URP). The URP ran until 1978. When it terminated, provincial and territorial partners requested that the regional geochemical surveys continue under the auspices of a series of Mineral Development Agreements (MDAs). The concept of a ‘National Geochemical Reconnaissance’ was first published by Cameron (1976). Through the 1980s and 1990s, large quantities of data, all of which were consistent with the NGR protocols, were acquired through MDAs. The NGR programme wound down in the first decade of this century.
The turn-around of the URP survey data as reports and maps, including all analytical QA/QC, required a major effort. Two new staff members joined the Exploration Geochemistry Subdivision of the Resource Geophysics and Geochemistry Division, Dave (D.J.) Ellwood and Greg (N.G.) Lund, and they were assisted by short-term contract staff. All data processing, including sample site-coordinate checking, QA/QC, summary statistics estimation and support for threshold selection was undertaken on the CDC Cyber mainframe, and the Departmental Calcomp drum plotter was used for plotting maps. The CDC SORT/MERGE package was used for data file management of the numerous URP/NGR data files that were of a simple tabular form. The Geochemical Analysis System (GAS) format was used for long-term storage. It included an extensive metadata section with text information on the particular geochemical survey, field and analytical data coding and formats, geochemical analysis methods and references, followed by a section with the actual tabular survey data (Garrett 1974b).
For the period when the URP geochemical maps were plotted as symbols, overlays for the cartographic compilation were plotted directly on to photographic film. This was achieved using a Barr & Stroud PS5 photo-head, with a glass disk on to which the URP/NGR symbology had been etched. The photo-head was mounted on the Surveys and Mapping Branch’s Kongsberg Kingmatic flat-bed plotter (Harris 1972) that read magnetic tapes prepared on the Cyber mainframe. The tapes contained sample coordinates and the index of the particular symbol to be exposed at that location. This opto-electronic procedure was fast and effective, helping speed up map production.
A key development at this time was the QA/QC procedure involving an application of unbalanced analysis of variance designs developed under contract with Hickling-Smith Inc., notably Tom (T.I.) Goss (Garrett & Goss 1979, 1980). At peak effort, the URP was releasing data as Open File reports and maps within 12 to 18 months of sample collection. Approximately fifteen 1:250 000 map sheets a year were published, representing a total of some 60 000 to 75 000 sample sites.
The URP and NGR work was supported by the addition of geochemists to the Exploration Geochemistry Subdivision: Bruce (S.B.) Ballantyne; Dan (D.R.) Boyle; Bill (W.B.) Coker; Wayne (W.G.) Goodfellow; Yvon (Y.T.) Maurice; and Barry (B.M.) Smee. For their work, the geochemists used a variety of locally written software and commercial packages for other sciences, such as BMDP and SPSS that were available through the Computer Science Centre. It was recognized that the use of these packages did not lead to camera-ready graphics for report and publication preparation, and the locally written software was not considered ‘friendly’ by most users. A review of the Subdivision’s computer requirements was undertaken in 1979, during which the concept of an interactive graphics package to meet geochemist’s needs was revived. Between 1980 and 1981, a functional specification of a system to meet the defined needs was developed through an Unsolicited Proposal from Hickling-Smith Inc., with Tom Goss, G. Jackson and John (J.C.) Nash playing major roles. The platform would again have been a minicomputer, but due to a lack of funds the project was set aside.
In 1984 the GSC, together with the Department’s Earth Physics Branch and Computer Science Centre, acquired a DEC VAX 11/780 running the VMS operating system. This system was well suited for interactive graphics development and had virtual memory capabilities that the Cyber NOS operating system lacked. This enabled the development of IDEAS, the Interactive Data Exploration and Analysis System, with the assistance of summer and co-op programme students from the University of Waterloo, and Ottawa and Carleton Universities (Garrett 1988). Notably, IDEAS included many of the graphics displays that were being promoted for use in Exploratory Data Analysis (Garrett, in press). It met many production needs, SPSS and locally written Fortran programs meeting others, and served as a development platform. However, in 1995, as part of a Federal Government programme review, the lease on the Department's two VAX 8700s was terminated. By that time some of the original Tektronix graphics terminals used with IDEAS had been replaced by PCs running emulation software, and Access, dBASEIII and Rbase System V were being widely used on PCs to manage and compile data for further processing and archiving. From 1995 onwards, all production and development have been undertaken on PC platforms, though for a period, map preparation applications for colour plotting continued on Unix workstations.
- © 2013 AAG/Geological Society of London