# Zazuko Submission for the OMG Semantic Augmentation Challenge

## Introduction

At Zazuko, we have worked with RDF and data integration for many years, building pipelines, tooling, and best practices for bringing data into well-defined, interoperable structures. Over time, we have seen many approaches to “mapping” and “documenting” data, from raw R2RML to graphical UI-based mapping tools.

From our perspective, while graphical or UI-based mapping tools have their place, they often become opaque, hard to track in version control, and cumbersome in complex, real-world data landscapes. We believe in **open standards and text-based representations** for mappings and models, ensuring transparency, reproducibility, and traceability using tools like Git. We love RDF and leverage the RDF stack in many of our projects, but it is important to clarify that the **goal of the OMG Semantic Augmentation Challenge is documentation, not the generation of RDF itself**. In our approach, **RDF is a beneficial side effect** of a well-defined, semantically rich documentation process.

In essence, RDF does provide a strong foundation to solve many of these questions around data structure and documentation. Done right, with proper augmentation and documentation of metadata and data models, RDF generation can happen “as a side effect” of these documentation efforts.

One key insight from our experience is that traditional R2RML mapping is essentially an **implicit model**, describing “how things connect” without explicitly exposing the underlying conceptual structure. By creating ELMO and, indirectly, SHACL and LinkML-based models, we **make the implicit model explicit**. This is crucial, as it enables validation, documentation, and true semantic augmentation, rather than only focusing on data transformation. We believe that in the future, ELMO could generate mappings automatically, making it possible to work within a **single DSL** for both documentation and mapping generation.


## Purpose of This Submission

This submission demonstrates a **scalable, model-driven workflow** for semantic augmentation of a CSV dataset using standards and open tooling. The focus is on **documenting and describing the data model**, with RDF generation being a byproduct of the process.

We built this pipeline to:

- Leverage well-established standards like R2RML and SHACL while improving the workflow through high-level DSLs and LinkML tooling.
- Avoid manually writing raw RDF or YAML, instead using DSLs for clearer, maintainable, and versionable representations.
- Explore ELMO as a **model-agnostic DSL for documentation**, outputting LinkML YAML for generation of RDF, SHACL, JSON Schema, and other representations.
- Demonstrate how this approach can scale to large datasets while maintaining transparency and alignment with semantic standards.

Additionally, we used ChatGPT extensively during this submission to accelerate the PoC mapping and documentation workflow. We have found that LLMs work exceptionally well with DSL-based approaches, providing high accuracy in syntax adherence and structured generation of repetitive definitions.

## Technical Workflow

We applied the following steps to complete this semantic augmentation workflow:

1. **Source Data**: We started with CSV data containing FDIC insured bank data.
2. **ELMO DSL Modeling**: We modeled the semantic structure in ELMO, which is model-agnostic and avoids the need to manually write YAML or RDF while capturing the semantics clearly.
3. **LinkML YAML Generation**: From ELMO, we generated LinkML YAML schemas, leveraging the LinkML ecosystem for validation, generation, and SHACL output.
4. **Validation**:
   - Used `linkml lint` to check schema syntax.
   - Used `linkml validate` to confirm data adherence to the schema.
5. **Generation of Semantic Artifacts**:
   - RDF (Turtle) using `linkml generate rdf`.
   - SHACL using `linkml generate shacl`.
   - PlantUML and Mermaid diagrams for model visualization.
6. **Materialization**:
   - DuckDB was used for high-performance CSV parsing.
   - R2RML generated via XRM DSL was executed using Ontop VKG to efficiently materialize RDF from DuckDB at scale, handling billions of rows when needed.

This workflow is **scalable and practical**, capable of producing hundreds of thousands of triples per second on commodity hardware.

## Why This Approach Matters

Over the years, we have learned what works in real-world semantic data integration and what does not:

- **UI-based mapping tools rarely scale well** for complex, evolving data models.
- **Text-based DSLs integrated with Git workflows allow teams to collaborate effectively, track changes, and ensure transparency**.
- **Semantic documentation should not only be a means to produce RDF but should itself be the primary artifact**, ensuring clarity, validation, and reusability.
- Using **LLMs as co-pilots with DSL workflows accelerates PoC mapping and documentation tasks dramatically**, especially when dealing with unfamiliar domains.
- This workflow demonstrates how **RDF can be a “side effect” of proper documentation and modeling**, closing the loop for semantic augmentation and interoperability.

## Files Included

- `omg.elmo`: Source DSL model in ELMO.
- `omg.yaml`: Generated LinkML YAML schema.
- `sample_fdic_bank.yaml`: Sample instance data.
- `omg.ttl`: Generated RDF (Turtle).
- `omg-shacl.ttl`: Generated SHACL constraints.
- `omg.puml`: PlantUML diagram for structure visualization.
- `omg.mermaid`: Mermaid ER diagram as an alternative visualization.

## Our Journey with XRM

An important part of our journey was the development of **XRM**, our DSL for scalable and maintainable R2RML, RML, and CSVW generation. XRM emerged after we found raw R2RML authoring to be too error-prone and cumbersome for large-scale, real-world data projects. 

We experimented briefly with CSVW, which shares the intention of providing metadata-rich, structured documentation for CSV data but did not see widespread adoption in practice, partly due to its JSON-based syntax not being particularly pleasant for human authorship. XRM served as an **important intermediate step**, allowing us to validate the effectiveness of DSLs for mapping generation and maintenance in production environments.

Our clients adopted XRM rapidly, using it for **maintaining and generating R2RML mappings at scale**, with the DSL approach providing clarity, validation, and ease of maintenance. This experience validated that DSLs could play a significant role in RDF-centric data workflows, reinforcing our belief in the combination of open standards, structured documentation, and DSL-based authoring as the most sustainable approach.

## LLM Collaboration Note

This submission was developed in close collaboration with ChatGPT, which acted as an efficient co-pilot to:

- Generate syntactically correct DSL structures rapidly.
- Produce semantic field descriptions and slot mappings.
- Create consistent SHACL and LinkML schemas.
- Draft and refine documentation efficiently.

Using LLMs in conjunction with DSL-based workflows proves to be a **highly effective pattern for mapping and documentation-heavy semantic augmentation tasks**.

## Summary

This submission demonstrates a clear, scalable, and standards-aligned approach to the OMG Semantic Augmentation Challenge. It showcases:

- Clear separation of documentation and transformation concerns.
- Leveraging DSLs for maintainability and transparency.
- Using open-source, scalable tooling for RDF generation as a side effect of well-structured documentation.
- A realistic path forward where semantic augmentation can scale while remaining transparent, traceable, and efficient.

We believe this approach demonstrates what semantic augmentation can look like in practice, aligning with the spirit of the challenge while staying true to what we have learned during years of practical semantic data integration work.

---

**For questions, integration discussions, or collaboration:**

Adrian Gschwend  
Zazuko  
[adrian.gschwend@zazuko.com](mailto:adrian.gschwend@zazuko.com)