# OMG Semantic Augmentation Challenge ## Introduction At [Zazuko](https://zazuko.com/), we have worked with RDF and data integration for many years, [building pipelines](https://github.com/zazuko/xrm-csv-workflow), [tooling](https://zazuko.com/tools/), and best practices for bringing data into well-defined, interoperable structures. Over time, we have seen many approaches to “mapping” and “documenting” data, from raw [R2RML](https://www.w3.org/TR/r2rml/) to graphical UI-based mapping tools. From our perspective, while graphical or UI-based mapping tools have their place, they often become opaque, hard to track in version control, and cumbersome in complex, real-world data landscapes. We believe in open standards and text-based representations for mappings and models, ensuring transparency, reproducibility, and traceability using tools like Git and DevOps Pipelines. We love RDF and leverage the RDF stack in many of our projects, but it is important to clarify that the goal of the [OMG Semantic Augmentation Challenge](https://www.omg.org/events/omg-semantic-augmentation-challenge/) is documentation, not the generation of RDF itself. In our approach, RDF is a beneficial side effect of a well-defined, semantically rich documentation process. One key insight from our experience is that while R2RML effectively maps relational data to RDF vocabularies, the data model it implies remains scattered across mapping rules and is not explicitly stated as a coherent, inspectable model. You can deduce parts of the model from R2RML mappings, but it is neither complete nor convenient for validation, documentation, or visualization. With [ELMO](https://zazuko.github.io/elmo-showcase-pages/showcase/) and [LinkML](https://linkml.io/), we explicitly state the model—or more precisely, a **metamodel describing classes, slots, cardinalities, and semantics**—in a structured way that aligns with how we think about data. This explicit model enables direct generation of [SHACL](https://www.w3.org/TR/shacl/) constraints, UML-like diagrams, and structured documentation, and can act as a clear contract for data producers and consumers. In this sense, ELMO and LinkML allow us to **move from implicit, scattered structure to explicit, reusable models**, while retaining the ability to generate RDF, SHACL, and R2RML as operational outputs. This is crucial, as it enables validation, documentation, and true semantic augmentation, rather than only focusing on data transformation. We believe that in the future, ELMO could generate mappings automatically, making it possible to work within a **single DSL** for both documentation and mapping generation. A **[domain-specific language](https://en.wikipedia.org/wiki/Domain-specific_language) (DSL)** in this context means a lightweight, human-readable text format tailored to expressing data models and mappings clearly, making them easy to maintain, version, and generate structured outputs from. ## Purpose of This Submission This submission demonstrates a **scalable, model-driven workflow** for semantic augmentation of a CSV dataset using standards and open tooling. The focus is on **documenting and describing the data model**, with RDF generation being a byproduct of the process. We built this pipeline to: - Leverage well-established standards like R2RML and SHACL while improving the workflow through high-level DSLs and LinkML tooling. - Avoid manually writing raw RDF or YAML, instead using DSLs for clearer, maintainable, and versionable representations. - Explore ELMO as a **model-agnostic DSL for documentation**, outputting LinkML YAML for generation of RDF, SHACL, JSON Schema, and other representations. - Demonstrate how this approach can scale to large datasets while maintaining transparency and alignment with semantic standards. Additionally, we used ChatGPT extensively during this submission to accelerate the PoC mapping and documentation workflow. We have found that LLMs work exceptionally well with DSL-based approaches, providing high accuracy in syntax adherence and structured generation of repetitive definitions. ## Technical Workflow We applied the following steps to complete this semantic augmentation workflow: 1. **Source Data**: We started with CSV data containing FDIC insured bank data. 2. **ELMO DSL Modeling**: We modeled the semantic structure in ELMO, which is model-agnostic and avoids the need to manually write YAML or RDF while capturing the semantics clearly. 3. **LinkML YAML Generation**: From ELMO, we generated LinkML YAML schemas, leveraging the LinkML ecosystem for validation, generation, and SHACL output. 4. **Validation**: - Used `linkml lint` to check schema syntax. - Used `linkml validate` to confirm data adherence to the schema. 5. **Generation of Semantic Artifacts**: - RDF (Turtle) using `linkml generate rdf`. - SHACL using `linkml generate shacl`. - [PlantUML](https://plantuml.com/) and [Mermaid](https://mermaid.js.org/) diagrams for model visualization, using `linkml generate plantuml` and `linkml generate erdiagram`. 6. **Materialization**: - [DuckDB](https://duckdb.org/) was used for high-performance CSV parsing. - R2RML generated via XRM DSL was executed using [Ontop VKG](https://ontop-vkg.org/) to efficiently materialize RDF from DuckDB at scale, handling billions of rows when needed. - This pipeline builds on our battle-tested, open-source ETL tooling, which we have developed and extended over many years to enable reproducible, CI/CD-driven data workflows for our clients, ensuring data pipelines are maintainable, transparent, and scalable. For reference, we maintain a [template repository](https://github.com/zazuko/xrm-csv-workflow) illustrating this approach. This workflow is **scalable and practical**, capable of producing hundreds of thousands of triples per second on commodity hardware. ## Why This Approach Matters Over the years, we have learned what works in real-world semantic data integration and what does not: - **UI-based mapping tools rarely scale well** for complex, evolving data models. - **Text-based DSLs integrated with Git workflows allow teams to collaborate effectively, track changes, and ensure transparency**. - **Semantic documentation should not only be a means to produce RDF but should itself be the primary artifact**, ensuring clarity, validation, and reusability. - Using **LLMs as co-pilots with DSL workflows accelerates PoC mapping and documentation tasks dramatically**, especially when dealing with unfamiliar domains. - This workflow demonstrates how **RDF generation can be a “side effect” of proper documentation and modeling**, closing the loop for semantic augmentation and interoperability. ## Files IncludedFiles Included ### Model - [`model/omg.elmo`](model/omg.elmo): ELMO DSL model defining the semantic structure. - [`model/omg.yaml`](model/omg.yaml): Generated LinkML YAML schema from ELMO for documentation and validation. - [`model/omg.ttl`](model/omg.ttl): Generated RDF (Turtle) representation of the data. - [`model/omg-shacl.ttl`](model/omg-shacl.ttl): Generated SHACL constraints for validation. - [`model/omg.puml`](model/omg.puml): PlantUML diagram visualizing the semantic structure. - [`model/omg.mermaid.md`](model/omg.mermaid.md): Mermaid ER diagram as an alternative visualization. ### Mappings - [`mappings/mappings.xrm`](mappings/mappings.xrm): XRM mapping definitions for R2RML generation. - [`mappings/sources.xrm`](mappings/sources.xrm): XRM source definitions describing CSV column structures. This file will have to be adjusted when the CSV headers change. - [`mappings/vocabularies.xrm`](mappings/vocabularies.xrm): XRM vocabulary definitions for consistent IRI management. - [`src-gen/mappings.r2rml.ttl`](src-gen/mappings.r2rml.ttl): Generated R2RML file from XRM, ready for use with Ontop VKG or other processors. ## Our Journey with XRM An important part of our journey was the development of **[XRM](https://github.com/zazuko/expressive-rdf-mapper)**, our DSL for scalable and maintainable R2RML, RML, and CSVW generation. XRM emerged after we found raw R2RML authoring to be too error-prone and cumbersome for large-scale, real-world data projects. We experimented briefly with [CSVW](https://www.w3.org/TR/tabular-data-primer/), which shares the intention of providing metadata-rich, structured documentation for CSV data but did not see widespread adoption in practice, partly due to its JSON-based syntax not being particularly pleasant for human authorship (XRM can output csvw mappings as well). XRM served as an important intermediate step, allowing us to validate the effectiveness of DSLs for mapping generation and maintenance in production environments. Our clients adopted XRM rapidly, using it for maintaining and generating R2RML mappings at scale, with the DSL approach providing clarity, validation, and ease of maintenance. This experience validated that DSLs could play a significant role in RDF-centric data workflows, reinforcing our belief in the combination of open standards, structured documentation, and DSL-based authoring as the most sustainable approach. ## LLM Collaboration Note This submission was developed in close collaboration with ChatGPT, which acted as an efficient co-pilot to: - Generate syntactically correct DSL structures rapidly. - Produce semantic field descriptions and slot mappings. - Draft and refine documentation efficiently. Using LLMs in conjunction with DSL-based workflows proves to be a highly effective pattern for mapping and documentation-heavy semantic augmentation tasks. ## Summary This submission demonstrates a clear, scalable, and standards-aligned approach to the OMG Semantic Augmentation Challenge. It showcases: - Clear separation of documentation and transformation concerns. - Leveraging DSLs for maintainability and transparency. - Using open-source, scalable tooling for RDF generation as a side effect of well-structured documentation. - A realistic path forward where semantic augmentation can scale while remaining transparent, traceable, and efficient. We believe this approach demonstrates what semantic augmentation can look like in practice, aligning with the spirit of the challenge while staying true to what we have learned during years of practical semantic data integration work. --- **For questions, integration discussions, or collaboration:** Adrian Gschwend Zazuko [adrian.gschwend@zazuko.com](mailto:adrian.gschwend@zazuko.com)