Aligning Classifications

A classification, in the context of statistics, is a breakdown of a statistical population into subdivisions and often into a hierarchy. Normally the subdivisions partition the whole population — they are mutually exclusive and collectively exhaustive (MECE). Each statistical dataset has a classification for each of the dimensions it uses. While some classifications are standardised (NUTS, SIC 2007), some are more ad hoc, extended from other classifications, or mixed together based on what the original questions were of the data.

As we publish more statistical data, we will need to establish methods and tools to manage the relationships between classifications and the dimensions used in the datasets, especially when moving from stove-piped publication of statistics that are not explicitly linked.

Harmonisation can be thought of as a process of agreeing on common definitions of things such that, in the future, we can move to using more standardised classifications. In the meantime, we can potentially create correspondence between the current classifications and these harmonised classifications such that the data can be re-published and adjusted or mapped to the new classifications.

Alignment is a process that can help with harmonisation, though it is concerned more with determining relationships between existing classifications such that we can identify what is the same, what is overlapping, what is different and how things relate to external data. Alignment must take into account the inherent constraints of classifications so as to be consistent.

Linking statistical data is therefore about making explicit the relationships between things, in this case between classifications, dimensions and also any external definitions. While there is some existing linkage between datasets through the use of already harmonised classifications, there are many other classifications used with various pedigrees that haven't been harmonised. The default for those classifications is to make no assumptions and publish the classification as it is used in the data, in its own name-space and separate from other classifications. Only where classifications used are demonstrably the same as each other should they be given the same name-space.

Todo: last bit alludes to the Open World Assumption; should explain.

Todo: explain what we get from linking and alignment:

We can ask cross cutting questions about e.g. a single country's migration and trade without having to work out what different identifiers are used in the different datasets.
We're not presented with multiple filters of what appear to be the same thing (world, whole world, all nationalities; multiple links to Germany).
By linking to external info we can follow to find out e.g. population, flag, language

Migration Example 1

We'll consider two datasets from different government departments that use similar but not quite the same breakdowns for nationality.

The Home Office

The Home Office publish immigration statistics including a spreadsheet called "entry clearance visas granted outside the UK".

Nationality in this dataset is based on documentation provided by the person requesting a visa, though it is complicated as countries and nationalities change over time.

In the Immigration Statistics release, some data are available by country of nationality. The country of nationality recorded is based on the documentation, generally passports, provided by the individual at the point of recording the details. For asylum statistics, the country of nationality is usually based on documentary evidence, although sometimes the asylum seeker would arrive in the UK without any such documentation.

Home Office User Guide

The User Guide provides a table at the end with a list of all the countries along with their old and new groupings, according to an ONS consultation of 2014.

As part of the process of transforming the data published in the spreadsheet into something that we can use in COGS, we extract the nationality classifications that are used in the data and record this separately as a "concept scheme" — a data structure for representing tree-like breakdown of the countries and groupings. At this stage, the concept scheme is stored as a CSV file which we can just about visualise as a radial tree below. Note that the top (shown in the centre) of the breakdown is "All nationalities" and there are two other levels of groupings.

Department for Work and Pensions

The DWP publish data about national insurance number allocations to overseas nationals. The summary tables are published as a spreadsheet.

Nationality in this dataset is recorded at the point of national insurance number (NINo) registration, "based on passport or other evidence of identity".

The nationality variable is quality checked for completeness. Any registrations recorded with a nationality that no longer represents an existing country are reclassified to reflect the most appropriate nationality at the time of publication.

DWP Background information and methodology

Datasets from Department for Work and Pensions and the Home Office use similar but slightly different breakdowns to classify the nationality of people, perhaps due to the reclassification mentioned in the methodology.

Again, as part of the transformation process, we extract the nationality classifications that are used in the data and record them in a concept scheme, currently stored in another CSV file and visualized in a radial tree below. Note this time that the top of the tree is "World" and there are again two levels of groupings.

Linking things together

As they stand, the two classifications use separate name-spaces for the identities of the countries and groupings, so that for instance, the Home Office has <http://gss-data.org.uk/def/class/ho-country-of-nationality/germany> while DWP uses <http://gss-data.org.uk/def/class/dwp-nationality/germany> for what may or may not be the same thing.

Using OpenRefine, we can reconcile the two codelists with Wikidata to relate each nationality with a corresponding Wikidata entity. Most of these entities should be countries, though some are territories, some are groups and some have no obvious correspondence.

The following combined classification is then derived automatically as follows:

The MECE constraints of the two separate classifications are made explicit as simple rules / constraints.
The Wikidata link is made "inverse functional", meaning that if two different identifiers are linked to the same Wikidata entity, then two identifiers both represent the same thing.
The dimension for each dataset is the same

An automated reasoner is then run to work out what the consequences are for the resulting combined classification. In the diagram, the nodes are coloured as follows:

Purple: Country/group is inferred to be the same in both classifications.
Yellow: Country/group is only in the Home Office classification.
Cyan: Country/group is only in the DWP classification.
Red: Country/group breaks some constraint/rule.

At this stage the diagram tells us that:

A few countries haven't been reconciled yet or are in one but not the other classification.
"Cyprus (Northern part of)" has broken some constraint.
Some groupings have been inferred to be the same based on their sub-grouping being the same, e.g. "EU 14".
Some groupings, although they've been inferred to be the same, still show up multiple times as they are in different subdivisions.
"World" and "All nationalities" are still distinct.

Note that the reasoner uses the "open world assumption", which means that until we can demonstrate that things are the same (or different), we can't assume they're not.

Looking through the results of the Wikidata reconciliation, some issues can be resolved:

The Home Office distinguish between Cyprus and "Cyprus (Northern part of)", which were wrongly given the same Wikidata ID Q229, whereas the Northern part has ID Q23681.
Macedonia should have ID Q221, not Q83958.
"Bonaire, Sint Eustatius and Saba" can be given ID Q27561.
The DWP lists "Aruba and Curaçao" and this has no single ID in Wikidata, it being two islands. There is a grouping, known as the ABC islands, for Aruba, Bonaire and Curaçao with ID Q19386. For the purposes of this alignment, we can just state that the DWP's Aruba and Curaçao class is the same as the (disjoint) union of the Home Office's two separate classes: Aruba and Curaçao.

This results in the following picture:

Woods and trees

While this combined breakdown is gradually making more sense, what we're ultimately hoping for is that the two populations for DWP and the Home Office can be shown to be the same: that is that "All nationalities" and "World" should represent the same population, just broken down in different ways.

We need to concentrate on what the current differences are by filtering out those countries and groupings that have already been deemed to be the same (the purple ones):