KGI-Bench Data Generation¶
This document describes how to build domain-specific KGI-Bench (Knowledge Graph Integration Benchmark) datasets from DBpedia data and Text2KGBench subontologies.
The pipeline lives in resources/text2kgbench/ and produces overlapping multi-source benchmark bundles in the same style as the movie-KG integration benchmark: a hidden reference graph per split, plus shaded seed and RDF source graphs for integration evaluation.
Overview¶
flowchart LR
A[DBpedia NT dump] --> B[generate_refac.py]
C[Text2KGBench ontology TTL] --> B
B --> D[property_filter NT]
D --> E[subgraph.py]
E --> F[reachable NT]
F --> G[splits.py]
G --> H[reference bundles]
H --> I[sources.py]
I --> J[seed + RDF sources]
| Step | Script | Input | Output |
|---|---|---|---|
| 1 | generate_refac.py |
Full DBpedia dump + ontology TTL | Property- and class-filtered subgraph |
| 2 | subgraph.py |
Property-filtered NT | Class-rooted reachable subgraph |
| 3 | splits.py |
Reachable NT | Overlapping splits with reference KGs |
| 4 | sources.py |
Splits directory | Shaded seed and RDF source graphs |
Run scripts from the repository root with uv run resources/text2kgbench/<script>.py ..., or cd resources/text2kgbench and use the shorter paths shown in examples below.
Prerequisites¶
- DBpedia dump as N-Triples (
.ntor.nt.bz2). A pre-filtered dump such asselected.nt.bz2from the DBpedia multi-source KG pipeline works well as input. - Text2KGBench ontology in Turtle (
.ttl). Ontologies map OWL classes and properties to DBpedia URIs viaowl:equivalentClass/owl:equivalentProperty. They are typically named like13_food_ontology.ttl. - Java 17 for local Spark runs (
sdk use java 17.0.16-temor equivalent). - Python environment managed with
uvfrom the ODIBEL project root.
Available Text2KGBench ontologies¶
| ID | Domain |
|---|---|
| 1 | University |
| 2 | Musical work |
| 3 | Airport |
| 4 | Building |
| 5 | Athlete |
| 6 | Politician |
| 7 | Company |
| 8 | Celestial body |
| 9 | Astronaut |
| 10 | Comics character |
| 11 | Mean of transportation |
| 12 | Monument |
| 13 | Food |
| 14 | Written work |
| 15 | Sports team |
| 16 | City |
| 17 | Artist |
| 18 | Scientist |
| 19 | Film |
For each domain, choose the primary DBpedia class as the root/main class (e.g. dbo:Food for ontology 13, dbo:WrittenWork for ontology 14). This class is used in steps 2 and 3.
Step 1 — Ontology-driven property filtering (generate_refac.py)¶
generate_refac.py reads a Text2KGBench ontology and extracts DBpedia classes and properties. It then filters a DBpedia dump to:
- Triples whose predicate is in the ontology property set (plus
rdf:typeandrdfs:label). - Entities whose types fall within the ontology class set.
Usage:
uv run resources/text2kgbench/generate_refac.py \
<input_path> \
<output_path> \
<ontology_path> \
[name_prefix]
Example (food domain, local paths):
uv run resources/text2kgbench/generate_refac.py \
file://./data/dbpedia-multi-source-kg-data/selected.nt.bz2 \
./data/text2kgbench/subgraphs \
./data/text2kgbench/kgpipe-ontologies/dbpedia_webnlg/13_food_ontology.ttl
If name_prefix is omitted, it is derived from the ontology filename (e.g. 13_food_ontology.ttl → ont_13).
Outputs under <output_path>/<name_prefix>_subgraph/:
| File | Description |
|---|---|
property_filter/ |
Filtered N-Triples (Spark part-* text files) |
property_filter_schema/ |
CSV schema graph derived from the filtered triples |
Existing outputs are skipped on re-run. When using the in-memory engine in later steps, merge property_filter/part-* into a single .nt file first (see step 2).
Step 2 — Reachable subgraph extraction (subgraph.py)¶
subgraph.py takes the property-filtered graph and extracts the class-rooted reachable subgraph. Starting from all entities of the root class (e.g. dbo:Food), it follows resource edges to include indirectly connected entities (cities, countries, etc.) and keeps all triples whose subject is in that reachable set.
Optional preprocessing: For English-only benchmarks, strip non-@en language tags from literals before this step:
cat property_filter/* \
| grep -Pv '@(?!en\b)[a-z]+' \
> property_filter_clean.nt
Usage:
uv run resources/text2kgbench/subgraph.py \
<input_path> \
<output_path> \
--root-class <dbo:Class or full URI> \
[--engine memory|spark] \
[--max-hops N] \
[--direction forward|backward|both]
Example:
uv run resources/text2kgbench/subgraph.py \
./data/text2kgbench/subgraphs/ont_13_subgraph/property_filter_clean.nt \
./data/text2kgbench/subgraphs/ont_13_subgraph/reachable.nt \
--root-class dbo:Food \
--engine memory
--engine memory(default): in-process adjacency traversal; suitable for medium graphs on a single machine.--engine spark: distributed traversal viarDF2on Spark; use for large graphs or cluster runs.
The output path must not already exist.
Step 3 — Overlapping splits and reference graphs (splits.py)¶
splits.py builds overlapping entity splits in the movie-KG style. It samples seed entities from the main class, creates several subsets with controlled overlap, and writes a reference knowledge graph per split by reachability expansion from each seed set.
Usage:
uv run resources/text2kgbench/splits.py \
<input_path> \
<output_dir> \
--main-class <dbo:Class or full URI> \
--subset-size <N> \
[--num-subsets 4] \
[--overlap-ratio 0.04] \
[--engine memory|spark] \
[--main-class-scope reachable|seeds] \
[--max-hops N] \
[--seed 42]
Example:
uv run resources/text2kgbench/splits.py \
./data/text2kgbench/subgraphs/ont_14_subgraph/reachable.nt \
./data/text2kgbench/splits/ont_14_writtenwork \
--main-class dbo:WrittenWork \
--num-subsets 4 \
--overlap-ratio 0.04 \
--subset-size 2500 \
--engine memory
Key parameters:
--subset-size: number of main-class seed entities per split (required).--overlap-ratio: target fraction of shared seeds between split pairs (default0.04).--main-class-scope:reachable(default): reference subgraph includes all entities reachable from seeds.seeds: only seed main-class entities are kept in the induced subgraph.
Output layout:
<output_dir>/
entities/master_entities.csv # all main-class entities in input
split_0/
index/entities.csv # seed entities for this split
kg/reference/
data.nt # per-split reference graph
data_agg.nt # cumulative reference (splits 0..N)
meta/verified_entities.csv
split_1/
...
The output directory must not already exist. The script prints seed and subgraph overlap statistics on completion.
Step 4 — Shaded source graphs (sources.py)¶
sources.py derives integration sources from the reference bundles produced by splits.py. It rewrites entity and ontology URIs into shaded namespaces (same approach as resources/movie-multi-source-kg/generate.py):
| Output | Namespace pattern | Role |
|---|---|---|
split_N/kg/seed/data.nt |
http://kg.org/resource/{md5} |
Seed KG source |
split_N/sources/rdf/data.nt |
http://kg.org/rdf/N/... |
Split-scoped RDF source |
Usage:
uv run resources/text2kgbench/sources.py <splits_dir> [--split N] [--overwrite]
Example:
uv run resources/text2kgbench/sources.py \
./data/text2kgbench/splits/ont_14_writtenwork
Each split also gets meta/verified_entities.csv under kg/seed/ and sources/rdf/.
End-to-end example (ontology 13 — Food)¶
# 1. Filter DBpedia by ontology classes and properties (Spark; writes a part-file directory)
uv run resources/text2kgbench/generate_refac.py \
file://./data/dbpedia-multi-source-kg-data/selected.nt.bz2 \
./data/text2kgbench/subgraphs \
./data/text2kgbench/kgpipe-ontologies/dbpedia_webnlg/13_food_ontology.ttl
# 1b. Merge Spark part files into one NT for the in-memory engine (optional @en filter)
cat ./data/text2kgbench/subgraphs/ont_13_subgraph/property_filter/* \
| grep -Pv '@(?!en\b)[a-z]+' \
> ./data/text2kgbench/subgraphs/ont_13_subgraph/property_filter.nt
# 2. Extract reachable subgraph from dbo:Food
uv run resources/text2kgbench/subgraph.py \
./data/text2kgbench/subgraphs/ont_13_subgraph/property_filter.nt \
./data/text2kgbench/subgraphs/ont_13_subgraph/reachable.nt \
--root-class dbo:Food \
--engine memory
# 3. Build four overlapping splits
uv run resources/text2kgbench/splits.py \
./data/text2kgbench/subgraphs/ont_13_subgraph/reachable.nt \
./data/text2kgbench/splits/ont_13_food \
--main-class dbo:Food \
--subset-size 2500 \
--engine memory
# 4. Generate shaded seed and RDF sources
uv run resources/text2kgbench/sources.py \
./data/text2kgbench/splits/ont_13_food
Running on a Spark cluster¶
For large DBpedia dumps, run generate_refac.py via spark-submit. Build wheels and dependencies for Python 3.10 (the default in the project Spark image), even if local development uses 3.12:
SPARK_PYTHON=3.10
uv build --python $SPARK_PYTHON
uv export --format requirements.txt --no-dev --no-hashes \
--prune pyspark --no-emit-package pyodibel --python $SPARK_PYTHON \
-o /tmp/spark-deps.txt
uv pip install --python $SPARK_PYTHON --target dist/deps --python-version 3.10 \
-r /tmp/spark-deps.txt
cd dist/deps && zip -r ../deps.zip . && cd ../..
spark-submit \
--master yarn \
--py-files dist/pyodibel-0.1.0-py3-none-any.whl,dist/deps.zip \
--driver-memory 8g \
--executor-memory 16g \
resources/text2kgbench/generate_refac.py \
hdfs:///path/to/selected.nt.bz2 \
hdfs:///path/to/subgraphs \
hdfs:///path/to/13_food_ontology.ttl
Use --engine spark on subgraph.py and splits.py when the intermediate graphs are too large for in-memory processing.
Environment variables¶
Scripts accept paths via CLI arguments or a .env file in resources/text2kgbench/:
| Variable | Used by |
|---|---|
INPUT_PATH |
generate_refac.py, subgraph.py, splits.py |
OUTPUT_PATH |
generate_refac.py, subgraph.py, splits.py, sources.py |
ROOT_CLASS |
subgraph.py |
MAIN_CLASS |
splits.py |
SUBSET_SIZE |
splits.py |
SPLITS_DIR |
sources.py |
Related scripts¶
sampling.py: optional downsampling of an NT file while preserving class/degree or relation distributions. Use when a reachable subgraph is still too large before splitting.crawl.py: fetches web resources for source acquisition (separate from the RDF pipeline above).
Further cluster-specific notes are in resources/text2kgbench/athena-ops/README.md.