# The 2025 OMG Semantic Augmentation Challenge


## Deliverables

### File that describes what is in each of the columns of the CSV


csv/FDIC_Insured_Banks.csv-metadata.json


### Supporting materials 


#### How the format works 

CSV on the Web (CSVW) - https://csvw.org/ - is a JSON format metadata document that describes the structure, types and semantics of the data in a CSV file. 

CSVW was developed as an output of the W3C CSV on the Web Working Group, as a standard way to express metadata about a CSV file and other kinds of tabular data. 


A CSVW metadata document is a W3C standard that can be used to document a CSV file, validate the contents of a CSV file, and also to transform a CSV file to JSON and RDF in a standardised way. See https://www.w3.org/TR/csv2json/ and https://www.w3.org/TR/csv2rdf/ 

The UK Government Digital Service recommends the use of CSVW to describe CSV files.


#### How extensible it would be to other formats beyond CSV
CSVW (CSV on the Web), a W3C standard, is primarily designed to describe CSV files using metadata (usually in JSON-LD).

It is possible that other tabular formats e.g. TSV and XLSX can be converted to CSV before applying CSVW. JSON files that encode tabular data could also potentially be transformed to CSV and then annotated with a CSVW metadata document.

There is no formal support in the CSVW specification for using CSVW to describe tabular data encoded in JSON or XML, and validation tools and processors developed to support the standard would not work unless extended. 


#### Evidence the mapping file works and how broadly it can be used
The core use case of CSVW is to provide metadata in JSON-LD format to describe a CSV file. It can be used to describe column names and data types, constraints e.g. required feilds, links to external vocabularies as well as provenance, licenses and annotations.

#### Output generated by the format when used with the source dataset


#### Features or limitations with respect to the mapping file 

CSVW can be used to describe: 
- columns including name, datatype
- primary keys, foreign keys
- URIs and RDF vocabularies for columns
- default values, null values and value formatting

CSVW can define how to interpret CSV as RDF triples and supports linking columns to RDF vocabularies. It also supports linking individual cell values to URIs

CSVW supports the description, validation and tranformation to RDF, of CSV files.

CSVW is a W3C standard and is machine-readable and tool-independent, and can be consume by web agents and semantic crawlers.

CSVW supports internationalisation with the use of language tags e.g. "@language" and encoding e.g. "UTF-8"

A limitation of CSVW is that it only works with CSV or tabular data and cannot be directly applied to JSON, XML or graph data. Another limitation is that current implementations of CSVW are limited. Also, there is not built-insupport for dataset versioning or change logs.

#### Processing environment in which the mapping file was run including version

CSV on the Web (CoW) was used to transform the CSVW file to RDF. 

This was sourced from https://github.com/CLARIAH/COW on 30 June 2025 and is version 2.00

```cow_tool build csv/FDIC_Insured_Banks.csv --base=http://example.com/FDIC_Insured_Banks```


#### How transferable the format is likely to be to other platforms - proprietary and open source
CSVW is a W3C standard and this lends itself to being transferable to other platforms including proprietary and open source. There are several implementations available at this time for transforming CSVW for RDF for instance, e.g. W3C's reference implementation csv2rdf and CoW. There is scope for other tools and code libraries to be extended to support CSVW e.g. RDFLib and OpenRefine.

#### How the format scales with respect to larger datasets 
CSVW uses a JSON-LD metadata file format. How the format scales depends to some extent on the complexity of the metadata, e.g. deep nesting and verbose metadata will affect scalability. 

A significant issue for scalability is the transformation to RDF. The two tools mentioned, CoW and csv2rdf, load the entire dataset into memory

Using concise, modular metadata file and avoiding complex column definitions would be a way to address scalability. Batch processing and designing a data pipeline for incrementing processing may be a workaround. Alternatively, work to develop a CSVW implementation with RDFLib and/or Apache Spark where scaling capability is excellent.