RDF [[RDF-CONCEPTS]] describes a graph-based data model for making claims
about the world and provides the foundation for reasoning upon that graph
of information. At times, it becomes necessary to compare the differences
between datasets of graphs, digitally sign them, or generate short
identifiers for graphs via hashing algorithms. This document outlines an
algorithm for normalizing RDF datasets such that these operations can be
This document is an experimental work in progress.
When data scientists discuss normalization, they do so in the context of
achieving a particular set of goals. Since the same information may
sometimes be expressed in a variety of different ways, it often becomes
necessary to be able to transform each of these different ways into a
single, standard format. With a standard format, the differences between
two different sets of data can be easily determined, a
cryptographically-strong hash identifier can be generated for a particular
set of data, and a particular set of data may be digitally-signed for later
In particular, this specification is about normalizing
RDF datasets, which are collections of graphs. Since
a directed graph can express the same information in more than one
way, it requires normalization to achieve the aforementioned goals
and any others that may arise via surrendipity.
Most RDF dataset can be normalized fairly quickly, in terms
of algorithmic time complexity. However, those that contain nodes that do
not have globally unique identifiers pose a greater challenge. Normalizing
these datasets presents the graph isomorphism problem, a
problem that is believed to be difficult to solve quickly. More formally,
it is believed to be an NP-Intermediate problem, that is, neither known to
be solvable in polynomial time nor NP-complete. Fortunately, existing real
world data is rarely modeled in a way that manifests this problem and new
data can be modeled to avoid it. In fact, software systems can detect a
problematic dataset and may choose to assume it's an attempted denial of
service attack, rather than a real input, and abort.
This document outlines an algorithm for generating a normalized
RDF dataset given an RDF dataset as input. The
algorithm is called the
Universal RDF Dataset Normalization Algorithm 2015 or
How to Read this Document
This document is a detailed specification for an RDF dataset
normalization algorithm. The document is primarily intended for the
- Software developers that want to implement an RDF dataset
To understand the basics in this specification you must be familiar with
basic RDF concepts [[!RDF-CONCEPTS]]. A working knowledge of
graph theory and
is also recommended.
There are a number of ways that one may participate in the development
of this specification:
This document uses the following terms as defined in JSON [[!RFC4627]]. Refer
to the JSON Grammar section in [[!RFC4627]] for formal definitions.
- JSON object
An object structure is represented as a pair of curly brackets surrounding
zero or more key-value pairs. A key is a string.
A single colon comes after each key, separating the key from the value.
A single comma separates a value from a following key.
- An array structure is represented as square brackets surrounding zero
or more values. Values are separated by commas.
In JSON, an array is an ordered sequence of zero or more values.
A string is a sequence of zero or more Unicode characters,
wrapped in double quotes, using backslash escapes (if necessary).
- A number is similar to that used in most programming languages, except
that the octal and hexadecimal formats are not used and leading zeros
are not allowed.
- true and false
Values that are used to express one of two possible boolean states.
- RDF dataset
- A dataset
as specified by [[RDF11-CONCEPTS]] representing a collection of
This algorithm is a work in progress, do not implement it.
Normalization is the process of transforming an input dataset
to a standard dataset. That is, for two
input datasets containing the same information, regardless
of their arrangement, will be transformed into identical
standard datasets. The problem is a fairly difficult technical
problem to solve because it requires directed graphs to be deterministically
ordered into sets of nodes and edges. This is easy to do when all of the
nodes have globally unique names, but difficult to do when some of the nodes
are not labeled. Any unlabeled nodes must be deterministically labeled.
In time, there may be more than one normalization algorithm and,
therefore, for identification purposes, this algorithm is named the
"Universal RDF Datset Normalization Algorithm 2015"
Normalization Algorithm Terms
- input dataset
The abstract RDF dataset that is provided as input to the
- standard dataset
The abstract RDF dataset that is produced as output by the
- graph name
- blank node
- blank node labeler
A blank node labeler maintains a mapping of existing blank node
labels to new ones. It also tracks the order in which new labels
The subject IRI associated with a graph node.
When performing the steps required by the normalization algorithm,
it is helpful to track state in a data structure called the
normalization state. The information contained in the
normalization state is described below.
- blank node to quads map
- A data structure that maps blank node labels to
the quads in which they appear in the
- hash to blank nodes map
- A data structure that maps a hash to a list of blank node
- canonical labeler
- A unique labeler that uses the prefix
assign labels to blank nodes.
- list of non-normalized blank nodes
Blank Node Labeler State
When relabeling blank nodes during the normalization
algorithm, blank node labelers need to keep track of
state. The information they need to track is described below.
- labeling prefix
The labeling prefix is a string that is used as the beginning of a
blank node label. It should be initialized to a
string that is specified by the normalization algorithm. The prefix is
used to help generate unique new labels for blank nodes.
It is concatenated with a labeling counter to produce
a blank node label. For example,
a proper initial value for the labeling prefix that would
produce blank node labels like
- labeling counter
A counter that is used to label blank nodes. It is
appended to the labeling prefix to create a
blank node label. It is initialized to
- labeler ordering list
A list that tracks the order in which existing blank node
labels were relabeled.
The normalization algorithm converts an input dataset
into a standard dataset. This algorithm will assign
deterministic labels to any blank nodes in the
Documenting the algorithm is a WIP, various steps will
become more detailed in time.
- Create the normalization state.
- For every quad in input dataset:
- For each blank node that occurs in the
quad's subject, object, or
graph name position, add a reference to the
quad using the blank node's
label in the blank node to quads map, creating a
new entry in the map if necessary.
- Populate the list of non-normalized blank nodes using
the keys from the blank node to quads map.
- Initialize an easy label flag to
- While the easy label flag is
canonically-label the easy-to-label blank nodes:
- Set the easy label flag to
- Clear the hash to blank nodes map.
- For each blank node label in
list of non-normalized blank nodes:
- Create a hash according to the
Hash Quads algorithm.
- Add the hash and blank node label to the
hash to blank nodes map, creating a new entry
- For each entry in the hash to blank nodes map in
- If the length of the list of
blank node labels is greater than
1, continue to the next entry.
- Create the canonical label for the single
blank node label in the list of
blank node labels using the
Node Relabeling algorithm.
- Remove the entry from the
hash to blank nodes map.
- Set the easy label flag to
- For each entry in the hash to blank nodes map in
- Create a hash path list.
- For each blank node label in the entry's list:
- If the blank node label has already
been mapped to a canonical label, continue to the next
- Create a blank node labeler using the prefix
- Assign the blank node label a new
label using the blank node labeler.
- Run the Hash Paths algorithm,
passing the blank node labeler, and append the result to the
hash path list.
- For each entry in the hash path list, sorted
- Assign a canonical label to each blank node
label in the entry's blank node labeler, in the order in
which they were assigned temporary labels.
- For each quad in input dataset:
- Create a copy the quad and replace any
blank node labels using the
- Add the copied quad to the
- Return the standard dataset.
Node Relabeling Algorithm
WIP; perhaps change name to Node Label
Generation/Assignment Algorithm as the actual relabeling occurs
This algorithm generates a mapping from an old blank node
label to a new unique blank node label. It also tracks
the order in which blank node labels were assigned.
The node relabeling algorithm is as follows:
- If the blank node has already been relabeled,
return the same unique new label that was returned previously.
- Append the blank node's label to the
labeler ordering list.
- Generate the new label by concatenating the
labeling prefix with the string value of the
- Increment the labeling counter.
- Return the new label.
Usually, when trying to determine if two nodes in a graph are
equivalent, you simply compare their identifiers. However, what if the
nodes don't have identifiers? Then you must determine if the two nodes
have equivalent connections to equivalent nodes all throughout the
whole graph. This is called the graph isomorphism problem. This
algorithm approaches this problem by considering how one might draw
a graph on paper. You can test to see if two nodes are equivalent
by drawing the graph twice. The first time you draw the graph the
first node is drawn in the center of the page. If you can draw the
graph a second time such that it looks just like the first, except
the second node is in the center of the page, then the nodes are
equivalent. This algorithm essentially defines a determinitsic way to
draw a graph where, if you begin with a particular node, the graph
will always be drawn the same way. If two graphs are drawn the same way
with two different nodes, then the nodes are equivalent. A hash is used
to indicate a particular way that the graph has been drawn and can
be used to compare nodes.
This algorithm works in concert with the main normalization algorithm
to produce a unique, deterministic identifier for a particular blank
node. This hash incorporates all of the information that is connected
to the blank node as well as how it is connected. It does this by
creating deterministic paths that emanate out from the blank node
through any other adjacent blank nodes. The process is as follows:
- Start by taking the perspective that the blank node is
at the center of the graph. This is the starting point for which all
paths will radiate outward.
- Create a blank node labeler and assign a label
to the starting blank node.
- Start the main message digest hash.
- For each adjacent blank node, determine a label for
it, first using the canonical label if already assigned, second
using the label from the local blank node labeler if
already assigned, and lastly, use the result of the
Hash Quads algorithm as the label.
- Create a group message digest hash, digesting data representing
the direction of the blank node relationship (subject
to object or object from subject) relative to the original
blank node, the predicate, and the
blank node label. Store a record of this group hash
and its matching blank node.
- Now iterate over each of the group hashes and related
blank nodes. For each group, use the main hash to
digest the group hash and create a copy of the
blank node labeler to be used in case recursion is
needed. Then, proceed to construct a path. For each permutation of
blank nodes in the group, concatenate the canonical
label for it, if already assigned, to the path. If not, and the
blank node labeler has not assigned it a label, take
note of this and then assign it one. If a path has already been
selected as the canonical path, stop permutating.
- If no path has been chosen as the canonical one, recurse into
each of the blank nodes that was assigned a label
in the previous step, passing the copy of the
blank node labeler to use in the algorithm. Append the
resulting hash of the recursion to the path and update the reference
to the copy of the blank node labeler to the one
given in the result of the algorithm.
- If no path has been chosen or the current path is shorter than
the currently chosen path, update the chosen path to the current
path. Set the chosen namer to match the one used to produce the
- Use the main hash to digest the chosen path and return it
and the chosen namer as the result of the algorithm.
When processing adjacent blank nodes, can they be skipped
if they've already been canonically labeled?
When creating group hashes, does the graph name also
need to be digested (not just direction of relationship, predicate,
and blank node label)? It may not be necessary as any structural
differences would get picked up by the Hash Quads algorithm or by
recursion. If there are no differences then it won't matter because
the same hash will be produced anyway.
The editors would like to thank Jeremy Carroll for his work on the
graph normalization problem, Gavin Carothers for providing valuable
feedback and testing input for the algorithm defined in this
specification, Sir Tim Berners Lee for his thoughts on graph normalization
over the years, and Jesús Arias Fisteus for his work on a similar
algorithm. Finally, a huge thank you goes out to Dave Longley who designed
and implemented the algorithms used in this specification, which turned out
to be a monumentally difficult design challenge.