# Semantic Mapping Format for CSV-to-JSON Transformation

This document defines a **declarative mapping format** that enables semantically rich transformation of flat CSV data into structured JSON representations aligned with the goals of the **OMG Semantic Augmentation Challenge**.

Rather than relying on procedural logic, this format allows you to express **conceptual model alignment**, **identifier binding**, and **value coercion** in a compact, portable way that supports reuse and automation.

---

## Why this Format?

This mapping syntax is designed to:

- Link flat tabular data to **nested semantic structures**
- Support **identifier-based merging and grouping**
- Define **schema-aware coercion** (string, number, boolean, date)
- Enable downstream use in JSON-LD, JSON Schema, and linked data
- Be interpretable by both humans and machines
- Allow alignment with **ontology-based models** (e.g. FIBO)

---

## Mapping Format Overview

The mapping is a JSON object where:

- **Keys** are **semantic paths** describing the structure of the output.
- **Values** describe how to extract or compute values from CSV rows.

Each mapping entry looks like this:

```json
"path.to.target": {
  "column": "CSV_COLUMN_NAME",
  "type": "string" // or number, boolean, date
}
```

Or for constant values:

```json
"path.to.constant": {
  "constant": "Institution"
}
```

---

## Path Syntax

The path syntax supports:

| Pattern                       | Purpose                                                       |
|-------------------------------|---------------------------------------------------------------|
| `object.property`             | Set a nested property                                         |
| `array[false]`                | Always append to the array                                    |
| `array[$.key == @COLUMN]`     | Find and reuse existing array items, or create if no match    |
| `map[{@COLUMN}]`              | Use a row value as a key in an object map. Create or match    |

---

## Value Resolution

Each mapping value supports:

| Field     | Meaning                                                  |
|-----------|----------------------------------------------------------|
| `column`  | The CSV column name                                      |
| `type`    | `"string"`, `"number"`, `"boolean"`, `"date"` — coercion |
| `constant`| A literal value                                          |

## Data Type Coercion

The mapping format supports coercing CSV strings into strongly typed JSON values using a `"type"` field. This ensures compatibility with schema validation, semantic tooling, and downstream processing.

Each type is interpreted as follows:

### `"string"`

Default behavior — no coercion is applied:

    { "column": "NAME", "type": "string" }

---

### `"number"`

- Trims leading zeros (e.g. `"007"` → `7`)
- Parses as a floating-point number
- Returns `undefined` for empty strings or `"NULL"`

    { "column": "AGE", "type": "number" }

---

### `"boolean"`

- Accepts the following truthy values: `"TRUE"`, `"1"`
- Accepts the following falsy values: `"FALSE"`, `"0"`
- Comparison is case-insensitive and trims whitespace
- Invalid or empty values return `undefined`

    { "column": "ENABLED", "type": "boolean" }

---

### `"date"`

Supports two formats:

1. **Excel-style serial numbers** (e.g. `45401` → `"2024-05-04"`)
2. **Plain date strings** (e.g. `"01/01/1890"`)

Output is always ISO-8601: `"YYYY-MM-DD"`

    { "column": "ESTYMD", "type": "date" }

---

### `"constant"`

You can skip `column` entirely and use a constant:

    { "constant": "Branch" }

This is useful for setting fixed `"type"` values or flags within the output.

---

### Undefined and Null Handling

If a value is empty (`''`) or the string `"NULL"`, it is treated as `undefined` and excluded from the output entirely.


---

## Strategic benefits 

This format has a number of strategic benefits:

- Captures semantic **structure and identity**
- Aligns flat data to **knowledge graph–ready** shapes
- Integrates cleanly with **JSON-LD contexts** and vocabularies
- Enables validation via **JSON Schema** 
- Facilitates automation in **ETL, modeling, and governance** pipelines

By using declarative mappings and semantic paths, this approach turns traditional tabular data into structured, interoperable knowledge assets.


---

## Example 

This input CSV

```csv
ID,NAME,AGE,ACTIVE,START
001,Alice,30,TRUE,44561
002,Bob,25,false,01/01/2020
003,Carol,NULL,,NULL

```

When converted with this mapping:

```json
{
    "type[]": { "constant": "Document" },
    "people[{@ID}].id": { "column": "ID", "type": "string" },
    "people[{@ID}].name": { "column": "NAME", "type": "string" },
    "people[{@ID}].age": { "column": "AGE", "type": "number" },
    "people[{@ID}].active": { "column": "ACTIVE", "type": "boolean" },
    "people[{@ID}].startDate": { "column": "START", "type": "date" }
}
```

Produces this JSON:
```json
{
  "type": [
    "Document"
  ],
  "people": {
    "001": {
      "id": "001",
      "name": "Alice",
      "age": 30,
      "active": true,
      "startDate": "2021-12-31"
    },
    "002": {
      "id": "002",
      "name": "Bob",
      "age": 25,
      "active": false,
      "startDate": "2019-12-31"
    },
    "003": {
      "id": "003",
      "name": "Carol"
    }
  }
}

```

This mapping:

- Uses the syntax people[{@ID}] to produce a map where each key is the row’s ID, and the value is the corresponding person object.
- Assigns structured fields such as `id`, `name`, `age`, `active`, and `startDate` using the corresponding CSV columns
- Applies type coercion:
  - `"age"` is parsed as a number
  - `"active"` is interpreted as a boolean (e.g. `"TRUE"` → `true`, `"false"` → `false`)
  - `"startDate"` is parsed as a date, supporting both Excel serial numbers and human-readable formats
- Skips any fields with `"NULL"` or empty values to avoid populating invalid or meaningless data
- Annotates the document with a top-level `type` of `"Document"` via a constant assignment


Observations: 
- "type" is a constant applied at the root.
- "people" is keyed by the value of ID.
- "age", "active", and "startDate" are omitted for Carol because they are "NULL" or blank and thus coerced to undefined.
- Excel serial 44561 (row 1) becomes "2021-12-31".
- "01/01/2020" string is parsed as "2020-01-01" in ISO format. 
- "false" (case-insensitive) becomes boolean false.