slang is a metadata langauge that allows you to describe the structure of a spreadsheet so that you can validate and extract the data from it. It can be used to interface human readable spreadsheets with automated parts of a business process or even just to extract tabular data from a spreadsheet that you have received.
slang takes inspiration from CSV Schema and unpivotr but is aimed at different users and sits in a different part of the data pipeline. The goal is to be able to describe business-domain level data with a rich, high-level type system, validate it and extract it from the kinds of spreadsheets that people use in their day-to-day work. After the data has been extracted it can be more easily converted to tabular, machine readable formats such as a CSV file with an associated CSV on the Web metafile or JSON. Because we know a lot about the type of each piece of data we can start to link and join it with other datasets that contain similar types in a way that would not be possible if the data was only available via a spreadsheet program.
Check out the language definition to see what you can put into
slang metadata files.
We started off with a simple spreadsheet that could be used by an office manager to decide which office supplies to order.
This spreadsheet contains the names of the items that need to be ordered as well as their prices and the quantities required. We don’t know if the prices are inclusive or exclusive of VAT.
You can extract the data from this spreadsheet by using the supplied
slang metadata file by invoking
make poc-json at the command line.
The program parses the
poc.slang metadata file, processes the spreadsheet file and then outputs the data in JSON.
poc.slang file is easy for regular users to edit because it is also a spreadsheet!
It is a simple Tab Separated file that contains directives and pragmas specified in the language definition. The user declares the data type to be used for each column (using the name of the column) as well as the location of the header row itself and, lastly, the cell range where the data exists.
The metadata file can also be edited in a text editor.
Finally, we did some mockups of an interactive service where a user would be guided through the process of selecting the interesting parts of their spreadsheet and choosing the data types from a list. This service would build the metadata file for them based on their choices.
Try the demo for yourself, or run
make render from the
widgets/ directory to build your own copy. Once its built, open the
out.html file that appears in the
Here the user selects the header row and then clicks the “Save and continue” button. This allows the service to add the
declare-header directive to the metadata file.
Here the user selects the cells that contain the data and then clicks the “Save and continue” button. This allows the service to add the
declare-data directive to the metadata file.
Here the user selects one of the column headers so that they can tell the service what type of data is to be found in that column.
Now the user selects the type of data from the list.
When the user clicks “Save and continue” the metadata file would be produced and used by the service to automatically extract the data from their spreadsheet. The data could then be passed on for further processing rather than requiring manual processing and re-keying by a person.
make run from the repository’s root directory you can see the program parsing the supplied
poc.slang metadata file and extracting the data from the spreadsheet into a Python array.
This program doesn’t do anything with the data so can be used as a template for converting the data to your own formats. The Python classes have pretty-printers so that you can see what the program knows about each piece of data and what it has done with it.
Here are some suggestions for how this program could be improved and extended.
The commit history contains the documentation you will need to understand the rationale behind the design. In the screenshot above we can see documentations about the deserialisers for each part of the metadata language itself. Each argument in
poc.slang will be ingested by one of the
unescape(). The code currently assumes that the metadata does not contain any characters that need to be escaped. The escaping rules are specified in the language definition.
declare-typeto take a cell from the header range as its first argument, rather than the name of the header. This involves using the cell deserialiser and writing another deserialiser that wraps string and cell. This deserialiser can be substituted at the appropriate place in the
poc-json-colsbranch contains a demo with a spreadsheet arranged in columns rather than rows. The existing code was written with this in mind but it doesn’t produce the expected result.
slangclass. It’s constructor currently takes a file descriptor that it expects to read lines from. The
deserialise()should be reusable but the new stream format will need its own implementation of
class Typeneeds an interface and the existing types need validators and extractors. An intermediate format needs to be defined and implemented so that the slang types are automatically defined across all spreadsheet formats.
textareato the web demo that builds up the contents of the metadata file as the user moves through the process.
declare-context. This directive is specified in the language definition but doesn’t currently have an implementation or any guidance on what to do with the information.
declare-contextrequires the directive deserialiser to support dependently-types arguments. This is because the type of the third argument changes depending on which field is specified in the second argument. This might be useful when implementing the eMail address parser for the
There are plenty more directions that you could take the language and implementation.