RDF [[RDF-CONCEPTS]] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between datasets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.

This document is an experimental work in progress.

Introduction

When data scientists discuss normalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to be able to transform each of these different ways into a single, standard format. With a standard format, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.

In particular, this specification is about normalizing RDF datasets, which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires normalization to achieve the aforementioned goals and any others that may arise via surrendipity.

Most RDF dataset can be normalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly. More formally, it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete. Fortunately, existing real world data is rarely modeled in a way that manifests this problem and new data can be modeled to avoid it. In fact, software systems can detect a problematic dataset and may choose to assume it's an attempted denial of service attack, rather than a real input, and abort.

This document outlines an algorithm for generating a normalized RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Normalization Algorithm 2015 or URDNA2015.

How to Read this Document

This document is a detailed specification for an RDF dataset normalization algorithm. The document is primarily intended for the following audiences:

To understand the basics in this specification you must be familiar with basic RDF concepts [[!RDF-CONCEPTS]]. A working knowledge of graph theory and graph isomorphism is also recommended.

Contributing

There are a number of ways that one may participate in the development of this specification:

Terminology

General Terminology

This document uses the following terms as defined in JSON [[!RFC4627]]. Refer to the JSON Grammar section in [[!RFC4627]] for formal definitions.

JSON object
An object structure is represented as a pair of curly brackets surrounding zero or more key-value pairs. A key is a string. A single colon comes after each key, separating the key from the value. A single comma separates a value from a following key.
array
An array structure is represented as square brackets surrounding zero or more values. Values are separated by commas. In JSON, an array is an ordered sequence of zero or more values.
string
A string is a sequence of zero or more Unicode characters, wrapped in double quotes, using backslash escapes (if necessary).
number
A number is similar to that used in most programming languages, except that the octal and hexadecimal formats are not used and leading zeros are not allowed.
true and false
Values that are used to express one of two possible boolean states.
null
RDF dataset
A dataset as specified by [[RDF11-CONCEPTS]] representing a collection of RDF graphs.

Normalization

This algorithm is a work in progress, do not implement it.

Normalization is the process of transforming an input dataset to a standard dataset. That is, for two input datasets containing the same information, regardless of their arrangement, will be transformed into identical standard datasets. The problem is a fairly difficult technical problem to solve because it requires directed graphs to be deterministically ordered into sets of nodes and edges. This is easy to do when all of the nodes have globally unique names, but difficult to do when some of the nodes are not labeled. Any unlabeled nodes must be deterministically labeled.

In time, there may be more than one normalization algorithm and, therefore, for identification purposes, this algorithm is named the "Universal RDF Datset Normalization Algorithm 2015" (URDNA2015).

Normalization Algorithm Terms

input dataset
The abstract RDF dataset that is provided as input to the algorithm.
standard dataset
The abstract RDF dataset that is produced as output by the algorithm.
quad
TODO
subject
TODO
predicate
TODO
object
TODO
graph name
TODO
blank node
TODO
blank node labeler
A blank node labeler maintains a mapping of existing blank node labels to new ones. It also tracks the order in which new labels were assigned.
label
The subject IRI associated with a graph node.

Normalization State

When performing the steps required by the normalization algorithm, it is helpful to track state in a data structure called the normalization state. The information contained in the normalization state is described below.

blank node to quads map
A data structure that maps blank node labels to the quads in which they appear in the input dataset.
hash to blank nodes map
A data structure that maps a hash to a list of blank node labels.
canonical labeler
A unique labeler that uses the prefix _:c14n to assign labels to blank nodes.
list of non-normalized blank nodes
TODO

Blank Node Labeler State

When relabeling blank nodes during the normalization algorithm, blank node labelers need to keep track of state. The information they need to track is described below.

labeling prefix
The labeling prefix is a string that is used as the beginning of a blank node label. It should be initialized to a string that is specified by the normalization algorithm. The prefix is used to help generate unique new labels for blank nodes. It is concatenated with a labeling counter to produce a blank node label. For example, _:c14n is a proper initial value for the labeling prefix that would produce blank node labels like _:c14n1.
labeling counter
A counter that is used to label blank nodes. It is appended to the labeling prefix to create a blank node label. It is initialized to 1.
labeler ordering list
A list that tracks the order in which existing blank node labels were relabeled.

Normalization Algorithm

The normalization algorithm converts an input dataset into a standard dataset. This algorithm will assign deterministic labels to any blank nodes in the input dataset.

Documenting the algorithm is a WIP, various steps will become more detailed in time.

Overview

Algorithm

  1. Create the normalization state.
  2. For every quad in input dataset:
    1. For each blank node that occurs in the quad's subject, object, or graph name position, add a reference to the quad using the blank node's label in the blank node to quads map, creating a new entry in the map if necessary.
  3. Populate the list of non-normalized blank nodes using the keys from the blank node to quads map.
  4. Initialize an easy label flag to true.
  5. While the easy label flag is true, canonically-label the easy-to-label blank nodes:
    1. Set the easy label flag to false.
    2. Clear the hash to blank nodes map.
    3. For each blank node label in list of non-normalized blank nodes:
      1. Create a hash according to the Hash Quads algorithm.
      2. Add the hash and blank node label to the hash to blank nodes map, creating a new entry if necessary.
    4. For each entry in the hash to blank nodes map in lexicographically-sorted order:
      1. If the length of the list of blank node labels is greater than 1, continue to the next entry.
      2. Create the canonical label for the single blank node label in the list of blank node labels using the Node Relabeling algorithm.
      3. Remove the entry from the hash to blank nodes map.
      4. Set the easy label flag to true.
  6. For each entry in the hash to blank nodes map in lexicographically-sorted order:
    1. Create a hash path list.
    2. For each blank node label in the entry's list:
      1. If the blank node label has already been mapped to a canonical label, continue to the next blank node.
      2. Create a blank node labeler using the prefix _:b.
      3. Assign the blank node label a new label using the blank node labeler.
      4. Run the Hash Paths algorithm, passing the blank node labeler, and append the result to the hash path list.
    3. For each entry in the hash path list, sorted by hash:
      1. Assign a canonical label to each blank node label in the entry's blank node labeler, in the order in which they were assigned temporary labels.
  7. For each quad in input dataset:
    1. Create a copy the quad and replace any blank node labels using the canonical labeler.
    2. Add the copied quad to the standard dataset.
  8. Return the standard dataset.

Node Relabeling Algorithm

WIP; perhaps change name to Node Label Generation/Assignment Algorithm as the actual relabeling occurs elsewhere.

Overview

Algorithm

This algorithm generates a mapping from an old blank node label to a new unique blank node label. It also tracks the order in which blank node labels were assigned.

The node relabeling algorithm is as follows:

  1. If the blank node has already been relabeled, return the same unique new label that was returned previously.
  2. Append the blank node's label to the labeler ordering list.
  3. Generate the new label by concatenating the labeling prefix with the string value of the labeling counter.
  4. Increment the labeling counter.
  5. Return the new label.

Hash Quads

TODO

Overview

Algorithm

Hash Paths

TODO

Overview

Usually, when trying to determine if two nodes in a graph are equivalent, you simply compare their identifiers. However, what if the nodes don't have identifiers? Then you must determine if the two nodes have equivalent connections to equivalent nodes all throughout the whole graph. This is called the graph isomorphism problem. This algorithm approaches this problem by considering how one might draw a graph on paper. You can test to see if two nodes are equivalent by drawing the graph twice. The first time you draw the graph the first node is drawn in the center of the page. If you can draw the graph a second time such that it looks just like the first, except the second node is in the center of the page, then the nodes are equivalent. This algorithm essentially defines a determinitsic way to draw a graph where, if you begin with a particular node, the graph will always be drawn the same way. If two graphs are drawn the same way with two different nodes, then the nodes are equivalent. A hash is used to indicate a particular way that the graph has been drawn and can be used to compare nodes.

This algorithm works in concert with the main normalization algorithm to produce a unique, deterministic identifier for a particular blank node. This hash incorporates all of the information that is connected to the blank node as well as how it is connected. It does this by creating deterministic paths that emanate out from the blank node through any other adjacent blank nodes. The process is as follows:

  • Start by taking the perspective that the blank node is at the center of the graph. This is the starting point for which all paths will radiate outward.
  • Create a blank node labeler and assign a label to the starting blank node.
  • Start the main message digest hash.
  • For each adjacent blank node, determine a label for it, first using the canonical label if already assigned, second using the label from the local blank node labeler if already assigned, and lastly, use the result of the Hash Quads algorithm as the label.
  • Create a group message digest hash, digesting data representing the direction of the blank node relationship (subject to object or object from subject) relative to the original blank node, the predicate, and the blank node label. Store a record of this group hash and its matching blank node.
  • Now iterate over each of the group hashes and related blank nodes. For each group, use the main hash to digest the group hash and create a copy of the blank node labeler to be used in case recursion is needed. Then, proceed to construct a path. For each permutation of blank nodes in the group, concatenate the canonical label for it, if already assigned, to the path. If not, and the blank node labeler has not assigned it a label, take note of this and then assign it one. If a path has already been selected as the canonical path, stop permutating.
  • If no path has been chosen as the canonical one, recurse into each of the blank nodes that was assigned a label in the previous step, passing the copy of the blank node labeler to use in the algorithm. Append the resulting hash of the recursion to the path and update the reference to the copy of the blank node labeler to the one given in the result of the algorithm.
  • If no path has been chosen or the current path is shorter than the currently chosen path, update the chosen path to the current path. Set the chosen namer to match the one used to produce the path.
  • Use the main hash to digest the chosen path and return it and the chosen namer as the result of the algorithm.

When processing adjacent blank nodes, can they be skipped if they've already been canonically labeled?

When creating group hashes, does the graph name also need to be digested (not just direction of relationship, predicate, and blank node label)? It may not be necessary as any structural differences would get picked up by the Hash Quads algorithm or by recursion. If there are no differences then it won't matter because the same hash will be produced anyway.

Algorithm

Acknowledgements

The editors would like to thank Jeremy Carroll for his work on the graph normalization problem, Gavin Carothers for providing valuable feedback and testing input for the algorithm defined in this specification, Sir Tim Berners Lee for his thoughts on graph normalization over the years, and Jesús Arias Fisteus for his work on a similar algorithm. Finally, a huge thank you goes out to Dave Longley who designed and implemented the algorithms used in this specification, which turned out to be a monumentally difficult design challenge.