Skip to content

Designing a CSV

A transcribed video walkthrough

Pre-requisites to follow along

csvcubed must be installed in order to proceed, please go back to installation.

How csvcubed interprets a CSV

csvcubed needs to understand how your statistical data is structured in order to make it more machine readable. There are two ways that you can do this with csvcubed, we are covering the configuration by convention approach in this quick-start. Configuration by convention requires a standard CSV data shape with conventional column titles and fill it out with your data which is explained briefly below.

Structuring your data

The standard shape of data is the recommended way to shape your data for csvcubed. It requires that your CSV has the following columns:

...Identifying characteristics... Value Measure Unit
... 3.4 Length Feet
... 3.6 Length Feet

In the above table:

  • identifying characteristics are one or more columns which identify the sub-set of the population that has been observed in a given row. These are called dimensions elsewhere in documentation.
  • the Value column contains the value which has been observed or measured; there is only ever one observed value per row in the standard shape.
  • the Measure column describes what has been observed or measured; note that the measure should not include any information about the units of measure.
  • the Unit column describes the unit of measure in which the Value has been recorded.

The names of the columns is how csvcubed interprets what each column contains in the configuration by convention approach. Using the column titles Value, Measure and Unit or one of their synonyms in your CSV will work. All other columns are assumed to be identifying characteristics (dimensions).

Adding your data

First, we start by taking the above shape and adding columns for each of your identifying characteristics (dimensions).

From hereonin we will be creating a data set to represent the competition winners in Eurovision. Our CSV will be structured as per the following extract:

Year Entrant Song Language Value Measure Unit
1974 ABBA Waterloo English 1 Final Rank Unitless
1974 ABBA Waterloo English 24 Final Points Unitless
1974 ABBA Waterloo English 6 People on Stage Number
2008 Charlotte Perrelli Hero English 5 People on Stage Number
2008 Charlotte Perrelli Hero English 18 Final Rank Unitless
2008 Charlotte Perrelli Hero English 47 Final Points Unitless
1
2
3
4
5
6
7
Year,Entrant,Song,Language,Value,Measure,Unit
1974,ABBA,Waterloo,English,1,Final Rank,Unitless
1974,ABBA,Waterloo,English,24,Final Points,Unitless
1974,ABBA,Waterloo,English,6,People on Stage,Number
2008,Charlotte Perrelli,Hero,English,5,People on Stage,Number
2008,Charlotte Perrelli,Hero,English,18,Final Rank,Unitless
2008,Charlotte Perrelli,Hero,English,47,Final Points,Unitless

Where Year, Entrant, Song and Language are the cube's identifying dimensions.

Note that we have included multiple measures in this dataset as Final Rank, Final Points and People on Stage have been recorded for each contestant.

You can download the full CSV here.

Next

The next step is to build a CSV-W.

Optional: further reading

The other way to configure a CSV-W cube is using the explicit configuration approach - you write a JSON configuration file which tells csvcubed exactly how to interpret your data.