Designing a CSV

A transcribed video walkthrough

Prerequisites to follow along

csvcubed must be installed in order to proceed, please go back to installation.

How csvcubed interprets a CSV

csvcubed needs to understand how your statistical data is structured in order to make it more machine readable. There are two ways that you can do this with csvcubed, we are covering the configuration by convention approach in this quick-start. Configuration by convention requires a standard CSV data shape with conventional column titles and fill it out with your data which is explained briefly below.

Structuring your data

Standard shape

The standard shape of data is the recommended way to start using csvcubed. It requires that your CSV has the following columns:

...Identifying characteristics...	Value	Measure	Unit
...	3.4	Length	Feet
...	3.6	Length	Feet

In the above table:

identifying characteristics are one or more columns which identify the sub-set of the population that has been observed in a given row. These are called dimensions elsewhere in documentation.
the Value column contains the value which has been observed or measured; there is only ever one observed value per row in the standard shape.
the Measure column describes what has been observed or measured; note that the measure should not include any information about the units of measure.
the Unit column describes the unit of measure in which the Value has been recorded.

The names of the columns is how csvcubed interprets what each column contains in the configuration by convention approach. Using the column titles Value, Measure and Unit or one of their synonyms in your CSV will work. All other columns are assumed to be identifying characteristics (dimensions).

Pivoted shape

Once you have gained some familiarity with using csvcubed, you may find that the pivoted shape is a better way to represent your data. See the Shaping your data section for more information on the pivoted shape.

Adding your data

First, we start by taking the above shape and adding columns for each of your identifying characteristics (dimensions).

From hereon in we will be creating a data set to represent the competition winners in Eurovision. Our CSV will be structured as per the following extract where Year, Entrant, Song and Language are the cube's identifying dimensions. Note that we have included multiple measures in this dataset, as Final Rank, Final Points and People on Stage are recorded for each contestant:

Year	Entrant	Song	Language	Value	Measure	Unit
1974	ABBA	Waterloo	English	1	Final Rank	Unitless
1974	ABBA	Waterloo	English	24	Final Points	Unitless
1974	ABBA	Waterloo	English	6	People on Stage	Number
2008	Charlotte Perrelli	Hero	English	5	People on Stage	Number
2008	Charlotte Perrelli	Hero	English	18	Final Rank	Unitless
2008	Charlotte Perrelli	Hero	English	47	Final Points	Unitless

Year,Entrant,Song,Language,Value,Measure,Unit
1974,ABBA,Waterloo,English,1,Final Rank,Unitless
1974,ABBA,Waterloo,English,24,Final Points,Unitless
1974,ABBA,Waterloo,English,6,People on Stage,Number
2008,Charlotte Perrelli,Hero,English,5,People on Stage,Number
2008,Charlotte Perrelli,Hero,English,18,Final Rank,Unitless
2008,Charlotte Perrelli,Hero,English,47,Final Points,Unitless

You can download the full CSV from GitHub.

Optional: further reading

The other way to configure a CSV-W cube is using the explicit configuration approach - you write a JSON configuration file which tells csvcubed exactly how to interpret your data.