# CSV annotation prototype with SysMLv2 and SysIDE

This proposal illustrates the use of a prototype software developed
at Sensmetry, based on our (more mature) SysML v2 offering.

The core concept of the proposal is to “invert” the problem of describing
CSV columns. Rather than annotating or mapping columns to information,
the user describes a model (in the SysMLv2-sense of the word) and then
annotates certain model elements as consisting of instances found in a
(or multiple) CSV files.

Schematically, rather than annotating the four columns of

```
international title,   director, country, copyright year
            Stalker,  Tarkovsky,    USSR, 1979
            Solaris, Soderbergh,     USA, 2002
                ...,        ...,     ...,  ...
```

to indicate their meaning, the user describes the contents (in simplified
notation):

```
films : Film {
   directed_by : Director {
      name
   }

   country_of_origin : Country {
      name
   }

   year : Year
}
```

and then adds information to indicate how to find instances of each in the
dataset.

```
films : Film {
   @columns("international title", "copyright year")

   name @ columns("international title")

   directed_by : Director {
      @columns("director")

      name @columns("director")
   }

   country_of_origin : Country {
      @columns("country")

      name @columns("country")
   }

   year : Year @ columns("copyright year")
}
```

The above would say that individual films correspond to unique values of
of the "international title", "copyright year" column, where the former
contributes the "name" while the latter contributes the "year". Directors
correspond to unique values of the "director" column, and co-occurrence 
on the same row links a film to the corresponding director through the
"directed_by" feature.

Linking to semantic web or ontology-type resources is then achieved by
annotating model elements to indicate that instance membership in the
model implies class membership in a resource identified by a certain IRI,
that a feature relating two instances implies a predicate (identified by 
a certain IRI), or that instances equal some resource, the IRI of which
can be derived from features of the instance.

This proposal implements the above idea, using SysML v2 (textual syntax) 
as the modelling language, SysIDE Editor (free and Open Source) or 
SysIDE Modeller for authoring, and custom tooling built on top of
SysIDE Automator for processing the user authored model into useful
derived artefacts.

The tooling currently produces two primary derived artefacts from the 
user model: documentation describing the columns and a materialised RDF
(turtle) file with all the assertions implied by the user model and the
provided CSV file.