# Using R2RML/RML, SPARQL, and Python to transform CSV to RDF

Submission to the __OMG Semantic Augmentation Challenge 2025__ by Ora Lassila.

See the file `SubmissionManifest.txt` for bureaucratic details.

Author's note: While my day job is as a Principal Technologist of the Amazon Neptune graph database team, I also run a small photography and publishing operation called [So Many Aircraft](https://www.somanyaircraft.com/). Over the years we have built various tools to make the publishing process easier, including management of a large photo library (60k+ photos) and a large reference library (5,000+ publications); all this is done via ontologies, RDF, and knowledge graphs.

## General idea

Tools and standards exist to easily transform any data, including CSV-formatted data, to RDF knowledge graphs. Simple application of R2RML (or rather RML) and SPARQL update queries, driven from a simple Python script, has tremendous power.

The existing Semantic Web standards are easy to use and apply to problems of data cleansing and transformation. I have taught non- computer scientists to quickly make use of these tools and productively take control of their data. One aim of this demonstration is to show that existing standards are sufficient.

This demonstration makes use of a Python package called `rdfhelpers` which (among many other features) provides a way to "cascade" RDF manipulation operations into a pipeline. Details are provided below.

## Implementation

The final implementation makes use of our own Python packages:
  - `tinyrml` ([repo](https://gitlab.com/somanyaircraft/tinyrml)): An implementation of RML/R2RML as a Python library, allowing source data to be fed into a mapping as any Python iterable object. The package extends RML by allowing Python expressions to be embedded into mappings; the Python compiler is used to compile the expressions before a mapping is applied (also: column names in the source data are available as free variables in expressions).
  - `rdfhelpers` ([repo](https://gitlab.com/somanyaircraft/rdfhelpers)): Useful RDF manipulation functionality built on top of the RDFLib framework, including the ability to "cascade" RDF manipulation operations via a "fluent" API (see [package documentation](https://rdf-helpers.readthedocs.io/en/latest/) for details). The package's class `Composable` implements this API, and in a way makes it possible to "compose" RDF manipulation operations, including SPARQL queries (e.g., SPARQL `CONSTRUCT` query results are available to the next operation).

These package are all open source, and are available from the [Python Package Index](https://pypi.org/) for easy installation into any Python project.

This is the general logic of the mapping and transformation pipeline:
  1. Contents of the CSV file are read and fed to the R2RML/RML mapping; this functionality comes from the `tinyrml` package.
  2. Some namespace declarations are added to make the result serialize neatly.
  3. A SPARQL update query removes an anomalous "empty" branch.
  4. A SPARQL update query transforms the location-related properties into a new node that is linked to a branch via the fdicib:address predicate (R2RML mapping itself cannot express this).
  5. Resulting graph is serialized (as Turtle, but it could easily be any format the underlying RDFLib framework supports).

The demonstration assumes Python version 3.10 or higher. Other requirements are listed in the file `requirements.txt`.

With the current mapping (which omits some columns; see below) transforming the about 79k rows of CSV results in little over 1.4 million RDF triples. The whole pipeline runs on my (10-year old Intel-based) Mac Mini in about 4 minutes.

An ontology is provided that models the resulting knowledge graph.

## Additional notes

The documentation for the CSV file was badly lacking, and it was in fact not possible to determine the meaning (and use) of some of the columns. The following coumns were not mapped in this demonstration (but could easily be added to the mapping once we discover what their meaning is): CBSA_DIV, CBSA_DIV_FLG, CBSA_DIV_NO, CBSA_METRO, CBSA_METRO_FLG, CBSA_METRO_NAME, CBSA_MICRO_FLG, CBSA_NO,CERT, CSA_FLG, CSA_NO, ESTYMD, MAINOFF, MDI_STATUS_CODE, MDI_STATUS_DESC, OFFNUM, SERVTYPE, SERVTYPE_DESC, STCNTY. The following columns were redundant or were simply not needed: X, Y, OBJECTID, STNAME.