KGI-Bench Data Generation¶

This document describes how to build domain-specific KGI-Bench (Knowledge Graph Integration Benchmark) datasets from DBpedia data and Text2KGBench subontologies.

The pipeline lives in resources/text2kgbench/ and produces overlapping multi-source benchmark bundles in the same style as the movie-KG integration benchmark: a hidden reference graph per split, plus shaded seed and RDF source graphs for integration evaluation.

Development Versions

Overview¶

flowchart LR
  A[DBpedia NT dump] --> B[generate_refac.py]
  C[Text2KGBench ontology TTL] --> B
  B --> D[property_filter NT]
  D --> E[subgraph.py]
  E --> F[reachable NT]
  F --> G[splits.py]
  G --> H[reference bundles]
  H --> I[sources.py]
  I --> J[seed + RDF sources]

Step	Script	Input	Output
1	`generate_refac.py`	Full DBpedia dump + ontology TTL	Property- and class-filtered subgraph
2	`subgraph.py`	Property-filtered NT	Class-rooted reachable subgraph
3	`splits.py`	Reachable NT	Overlapping splits with reference KGs
4	`sources.py`	Splits directory	Shaded seed and RDF source graphs

Run scripts from the repository root with uv run resources/text2kgbench/<script>.py ..., or cd resources/text2kgbench and use the shorter paths shown in examples below.

Prerequisites¶

DBpedia dump as N-Triples (.nt or .nt.bz2). A pre-filtered dump such as selected.nt.bz2 from the DBpedia multi-source KG pipeline works well as input.
Text2KGBench ontology in Turtle (.ttl). Ontologies map OWL classes and properties to DBpedia URIs via owl:equivalentClass / owl:equivalentProperty. They are typically named like 13_food_ontology.ttl.
Java 17 for local Spark runs (sdk use java 17.0.16-tem or equivalent).
Python environment managed with uv from the ODIBEL project root.

Available Text2KGBench ontologies¶

ID	Domain
1	University
2	Musical work
3	Airport
4	Building
5	Athlete
6	Politician
7	Company
8	Celestial body
9	Astronaut
10	Comics character
11	Mean of transportation
12	Monument
13	Food
14	Written work
15	Sports team
16	City
17	Artist
18	Scientist
19	Film

For each domain, choose the primary DBpedia class as the root/main class (e.g. dbo:Food for ontology 13, dbo:WrittenWork for ontology 14). This class is used in steps 2 and 3.

Step 1 — Ontology-driven property filtering (`generate_refac.py`)¶

generate_refac.py reads a Text2KGBench ontology and extracts DBpedia classes and properties. It then filters a DBpedia dump to:

Triples whose predicate is in the ontology property set (plus rdf:type and rdfs:label).
Entities whose types fall within the ontology class set.

Usage:

uv run resources/text2kgbench/generate_refac.py \
  <input_path> \
  <output_path> \
  <ontology_path> \
  [name_prefix]

Example (food domain, local paths):

uv run resources/text2kgbench/generate_refac.py \
  file://./data/dbpedia-multi-source-kg-data/selected.nt.bz2 \
  ./data/text2kgbench/subgraphs \
  ./data/text2kgbench/kgpipe-ontologies/dbpedia_webnlg/13_food_ontology.ttl

If name_prefix is omitted, it is derived from the ontology filename (e.g. 13_food_ontology.ttl → ont_13).

Outputs under <output_path>/<name_prefix>_subgraph/:

File	Description
`property_filter/`	Filtered N-Triples (Spark `part-*` text files)
`property_filter_schema/`	CSV schema graph derived from the filtered triples

Existing outputs are skipped on re-run. When using the in-memory engine in later steps, merge property_filter/part-* into a single .nt file first (see step 2).

Step 2 — Reachable subgraph extraction (`subgraph.py`)¶

subgraph.py takes the property-filtered graph and extracts the class-rooted reachable subgraph. Starting from all entities of the root class (e.g. dbo:Food), it follows resource edges to include indirectly connected entities (cities, countries, etc.) and keeps all triples whose subject is in that reachable set.

Optional preprocessing: For English-only benchmarks, strip non-@en language tags from literals before this step:

cat property_filter/* \
  | grep -Pv '@(?!en\b)[a-z]+' \
  > property_filter_clean.nt

Usage:

uv run resources/text2kgbench/subgraph.py \
  <input_path> \
  <output_path> \
  --root-class <dbo:Class or full URI> \
  [--engine memory|spark] \
  [--max-hops N] \
  [--direction forward|backward|both]

Example:

uv run resources/text2kgbench/subgraph.py \
  ./data/text2kgbench/subgraphs/ont_13_subgraph/property_filter_clean.nt \
  ./data/text2kgbench/subgraphs/ont_13_subgraph/reachable.nt \
  --root-class dbo:Food \
  --engine memory

--engine memory (default): in-process adjacency traversal; suitable for medium graphs on a single machine.
--engine spark: distributed traversal via rDF2 on Spark; use for large graphs or cluster runs.

The output path must not already exist.

Step 3 — Overlapping splits and reference graphs (`splits.py`)¶

splits.py builds overlapping entity splits in the movie-KG style. It samples seed entities from the main class, creates several subsets with controlled overlap, and writes a reference knowledge graph per split by reachability expansion from each seed set.

Usage:

uv run resources/text2kgbench/splits.py \
  <input_path> \
  <output_dir> \
  --main-class <dbo:Class or full URI> \
  --subset-size <N> \
  [--num-subsets 4] \
  [--overlap-ratio 0.04] \
  [--engine memory|spark] \
  [--main-class-scope reachable|seeds] \
  [--max-hops N] \
  [--seed 42]

Example:

uv run resources/text2kgbench/splits.py \
  ./data/text2kgbench/subgraphs/ont_14_subgraph/reachable.nt \
  ./data/text2kgbench/splits/ont_14_writtenwork \
  --main-class dbo:WrittenWork \
  --num-subsets 4 \
  --overlap-ratio 0.04 \
  --subset-size 2500 \
  --engine memory

Key parameters:

--subset-size: number of main-class seed entities per split (required).
--overlap-ratio: target fraction of shared seeds between split pairs (default 0.04).
--main-class-scope:
reachable (default): reference subgraph includes all entities reachable from seeds.
seeds: only seed main-class entities are kept in the induced subgraph.

Output layout:

<output_dir>/
  entities/master_entities.csv          # all main-class entities in input
  split_0/
    index/entities.csv                  # seed entities for this split
    kg/reference/
      data.nt                           # per-split reference graph
      data_agg.nt                       # cumulative reference (splits 0..N)
      meta/verified_entities.csv
  split_1/
    ...

The output directory must not already exist. The script prints seed and subgraph overlap statistics on completion.

Step 4 — Shaded source graphs (`sources.py`)¶

sources.py derives integration sources from the reference bundles produced by splits.py. It rewrites entity and ontology URIs into shaded namespaces (same approach as resources/movie-multi-source-kg/generate.py):

Output	Namespace pattern	Role
`split_N/kg/seed/data.nt`	`http://kg.org/resource/{md5}`	Seed KG source
`split_N/sources/rdf/data.nt`	`http://kg.org/rdf/N/...`	Split-scoped RDF source

Usage:

uv run resources/text2kgbench/sources.py <splits_dir> [--split N] [--overwrite]

Example:

uv run resources/text2kgbench/sources.py \
  ./data/text2kgbench/splits/ont_14_writtenwork

Each split also gets meta/verified_entities.csv under kg/seed/ and sources/rdf/.

End-to-end example (ontology 13 — Food)¶

# 1. Filter DBpedia by ontology classes and properties (Spark; writes a part-file directory)
uv run resources/text2kgbench/generate_refac.py \
  file://./data/dbpedia-multi-source-kg-data/selected.nt.bz2 \
  ./data/text2kgbench/subgraphs \
  ./data/text2kgbench/kgpipe-ontologies/dbpedia_webnlg/13_food_ontology.ttl

# 1b. Merge Spark part files into one NT for the in-memory engine (optional @en filter)
cat ./data/text2kgbench/subgraphs/ont_13_subgraph/property_filter/* \
  | grep -Pv '@(?!en\b)[a-z]+' \
  > ./data/text2kgbench/subgraphs/ont_13_subgraph/property_filter.nt

# 2. Extract reachable subgraph from dbo:Food
uv run resources/text2kgbench/subgraph.py \
  ./data/text2kgbench/subgraphs/ont_13_subgraph/property_filter.nt \
  ./data/text2kgbench/subgraphs/ont_13_subgraph/reachable.nt \
  --root-class dbo:Food \
  --engine memory

# 3. Build four overlapping splits
uv run resources/text2kgbench/splits.py \
  ./data/text2kgbench/subgraphs/ont_13_subgraph/reachable.nt \
  ./data/text2kgbench/splits/ont_13_food \
  --main-class dbo:Food \
  --subset-size 2500 \
  --engine memory

# 4. Generate shaded seed and RDF sources
uv run resources/text2kgbench/sources.py \
  ./data/text2kgbench/splits/ont_13_food

Running on a Spark cluster¶

For large DBpedia dumps, run generate_refac.py via spark-submit. Build wheels and dependencies for Python 3.10 (the default in the project Spark image), even if local development uses 3.12:

SPARK_PYTHON=3.10

uv build --python $SPARK_PYTHON

uv export --format requirements.txt --no-dev --no-hashes \
  --prune pyspark --no-emit-package pyodibel --python $SPARK_PYTHON \
  -o /tmp/spark-deps.txt
uv pip install --python $SPARK_PYTHON --target dist/deps --python-version 3.10 \
  -r /tmp/spark-deps.txt
cd dist/deps && zip -r ../deps.zip . && cd ../..

spark-submit \
  --master yarn \
  --py-files dist/pyodibel-0.1.0-py3-none-any.whl,dist/deps.zip \
  --driver-memory 8g \
  --executor-memory 16g \
  resources/text2kgbench/generate_refac.py \
  hdfs:///path/to/selected.nt.bz2 \
  hdfs:///path/to/subgraphs \
  hdfs:///path/to/13_food_ontology.ttl

Use --engine spark on subgraph.py and splits.py when the intermediate graphs are too large for in-memory processing.

Environment variables¶

Scripts accept paths via CLI arguments or a .env file in resources/text2kgbench/:

Variable	Used by
`INPUT_PATH`	`generate_refac.py`, `subgraph.py`, `splits.py`
`OUTPUT_PATH`	`generate_refac.py`, `subgraph.py`, `splits.py`, `sources.py`
`ROOT_CLASS`	`subgraph.py`
`MAIN_CLASS`	`splits.py`
`SUBSET_SIZE`	`splits.py`
`SPLITS_DIR`	`sources.py`

sampling.py: optional downsampling of an NT file while preserving class/degree or relation distributions. Use when a reachable subgraph is still too large before splitting.
crawl.py: fetches web resources for source acquisition (separate from the RDF pipeline above).

Further cluster-specific notes are in resources/text2kgbench/athena-ops/README.md.

KGI-Bench Data Generation¶

Overview¶

Prerequisites¶

Available Text2KGBench ontologies¶

Step 1 — Ontology-driven property filtering (generate_refac.py)¶

Step 2 — Reachable subgraph extraction (subgraph.py)¶

Step 3 — Overlapping splits and reference graphs (splits.py)¶

Step 4 — Shaded source graphs (sources.py)¶

End-to-end example (ontology 13 — Food)¶

Running on a Spark cluster¶

Environment variables¶

Related scripts¶

Step 1 — Ontology-driven property filtering (`generate_refac.py`)¶

Step 2 — Reachable subgraph extraction (`subgraph.py`)¶

Step 3 — Overlapping splits and reference graphs (`splits.py`)¶

Step 4 — Shaded source graphs (`sources.py`)¶