A classification, in the context of statistics, is a breakdown of a statistical population into subdivisions and often into a hierarchy. Normally the subdivisions partition the whole population — they are mutually exclusive and collectively exhaustive (MECE). Each statistical dataset has a classification for each of the dimensions it uses. While some classifications are standardised (NUTS, SIC 2007), some are more ad hoc, extended from other classifications, or mixed together based on what the original questions were of the data.
As we publish more statistical data, we will need to establish methods and tools to manage the relationships between classifications and the dimensions used in the datasets, especially when moving from stove-piped publication of statistics that are not explicitly linked.
Harmonisation can be thought of as a process of agreeing on common definitions of things such that, in the future, we can move to using more standardised classifications. In the meantime, we can potentially create correspondence between the current classifications and these harmonised classifications such that the data can be re-published and adjusted or mapped to the new classifications.
Alignment is a process that can help with harmonisation, though it is concerned more with determining relationships between existing classifications such that we can identify what is the same, what is overlapping, what is different and how things relate to external data. Alignment must take into account the inherent constraints of classifications so as to be consistent.
Linking statistical data is therefore about making explicit the relationships between things, in this case between classifications, dimensions and also any external definitions. While there is some existing linkage between datasets through the use of already harmonised classifications, there are many other classifications used with various pedigrees that haven't been harmonised. The default for those classifications is to make no assumptions and publish the classification as it is used in the data, in its own name-space and separate from other classifications. Only where classifications used are demonstrably the same as each other should they be given the same name-space.
Todo: last bit alludes to the Open World Assumption; should explain.
Todo: explain what we get from linking and alignment:
We'll consider two datasets from different government departments that use similar but not quite the same breakdowns for nationality.
The Home Office publish immigration statistics including a spreadsheet called "entry clearance visas granted outside the UK".
Nationality in this dataset is based on documentation provided by the person requesting a visa, though it is complicated as countries and nationalities change over time.
In the Immigration Statistics release, some data are available by country of nationality. The country of nationality recorded is based on the documentation, generally passports, provided by the individual at the point of recording the details. For asylum statistics, the country of nationality is usually based on documentary evidence, although sometimes the asylum seeker would arrive in the UK without any such documentation.
The User Guide provides a table at the end with a list of all the countries along with their old and new groupings, according to an ONS consultation of 2014.
As part of the process of transforming the data published in the spreadsheet into something that we can use in COGS, we extract the nationality classifications that are used in the data and record this separately as a "concept scheme" — a data structure for representing tree-like breakdown of the countries and groupings. At this stage, the concept scheme is stored as a CSV file which we can just about visualise as a radial tree below. Note that the top (shown in the centre) of the breakdown is "All nationalities" and there are two other levels of groupings.
The DWP publish data about national insurance number allocations to overseas nationals. The summary tables are published as a spreadsheet.
Nationality in this dataset is recorded at the point of national insurance number (NINo) registration, "based on passport or other evidence of identity".
The nationality variable is quality checked for completeness. Any registrations recorded with a nationality that no longer represents an existing country are reclassified to reflect the most appropriate nationality at the time of publication.
Datasets from Department for Work and Pensions and the Home Office use similar but slightly different breakdowns to classify the nationality of people, perhaps due to the reclassification mentioned in the methodology.
Again, as part of the transformation process, we extract the nationality classifications that are used in the data and record them in a concept scheme, currently stored in another CSV file and visualized in a radial tree below. Note this time that the top of the tree is "World" and there are again two levels of groupings.
As they stand, the two classifications use separate
name-spaces for the identities of the countries and
groupings, so that for instance, the Home Office
has <http://gss-data.org.uk/def/class/ho-country-of-nationality/germany>
while DWP
uses <http://gss-data.org.uk/def/class/dwp-nationality/germany>
for what may or may not be the same thing.
Using OpenRefine, we can reconcile the two codelists with Wikidata to relate each nationality with a corresponding Wikidata entity. Most of these entities should be countries, though some are territories, some are groups and some have no obvious correspondence.
The following combined classification is then derived automatically as follows:
At this stage the diagram tells us that:
Looking through the results of the Wikidata reconciliation, some issues can be resolved:
While this combined breakdown is gradually making more sense, what we're ultimately hoping for is that the two populations for DWP and the Home Office can be shown to be the same: that is that "All nationalities" and "World" should represent the same population, just broken down in different ways.
We need to concentrate on what the current differences are by filtering out those countries and groupings that have already been deemed to be the same (the purple ones):