RDF [[RDF-CONCEPTS]] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between sets of graphs, digitally sign them, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF datasets such that these operations can be performed.

This document is a work in progress.

Introduction

When data scientists discuss normalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to be able to transform each of these different ways into a single, standard format. With a standard format, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.

In particular, this specification is about normalizing RDF datasets, which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires normalization to achieve the aforementioned goals and any others that may arise via surrendipity.

Most RDF dataset can be normalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly. More formally, it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete. Fortunately, existing real world data is rarely modeled in a way that manifests this problem and new data can be modeled to avoid it. In fact, software systems can detect a problematic dataset and may choose to assume it's an attempted denial of service attack, rather than a real input, and abort.

This document outlines an algorithm for generating a normalized RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Normalization Algorithm 2015 or URDNA2015.

How to Read this Document

This document is a detailed specification for an RDF dataset normalization algorithm. The document is primarily intended for the following audiences:

To understand the basics in this specification you must be familiar with basic RDF concepts [[!RDF-CONCEPTS]]. A working knowledge of graph theory and graph isomorphism is also recommended.

Contributing

There are a number of ways that one may participate in the development of this specification:

Terminology

General Terminology

string
A string is a sequence of zero or more Unicode characters.
true and false
Values that are used to express one of two possible boolean states.
IRI
An IRI (Internationalized Resource Identifier) is a string that conforms to the syntax defined in [[RFC3987]].
RDF subject
A subject as specified by [[RDF11-CONCEPTS]].
RDF predicate
A predicate as specified by [[RDF11-CONCEPTS]].
RDF object
An object as specified by [[RDF11-CONCEPTS]].
RDF triple
A triple as specified by [[RDF11-CONCEPTS]].
RDF graph
An RDF graph as specified by [[RDF11-CONCEPTS]].
RDF graph name
A graph name as specified by [[RDF11-CONCEPTS]].
RDF quad
The pairing of an RDF triple with an RDF graph name.
RDF dataset
A dataset as specified by [[RDF11-CONCEPTS]].
RDF blank node
A blank node as specified by [[RDF11-CONCEPTS]]. In short, it is a node in a graph that is neither an IRI, nor a literal.
RDF blank node identifier
A blank node identifier as specified by [[RDF11-CONCEPTS]]. In short, it is a string that begins with _: that is used as an identifier for an RDF blank node. RDF blank node identifiers are typically implementation-specific local identifiers; this document specifies an algorithm for deterministically specifying them.

Normalization

Normalization is the process of transforming an input RDF dataset to a normalized RDF dataset. That is, any two input RDF datasets that contain the same information, regardless of their arrangement, will be transformed into identical normalized RDF datasets. The problem requires directed graphs to be deterministically ordered into sets of nodes and edges. This is easy to do when all of the nodes have globally-unique identifiers, but can be difficult to do when some of the nodes do not. Any nodes without globally-unique identifiers must be issued deteriministic identifiers.

In time, there may be more than one normalization algorithm and, therefore, for identification purposes, this algorithm is named the "Universal RDF Dataset Normalization Algorithm 2015" (URDNA2015).

Normalization Algorithm Terms

input RDF dataset
The abstract RDF dataset that is provided as input to the algorithm.
normalized RDF dataset
The abstract RDF dataset and normalized RDF blank node identifiers that are produced as output by the algorithm.
identifier issuer
An identifier issuer is used to issue new RDF blank node identifiers. It maintains a blank node identifier issuer state.
hash
The lowercase, hexadecimal representation of a message digest.
hash algorithm
The hash algorithm used by URDNA2015, namely, SHA-256.

Normalization State

When performing the steps required by the normalization algorithm, it is helpful to track state in a data structure called the normalization state. The information contained in the normalization state is described below.

blank node to quads map
A data structure that maps RDF blank node identifiers to the RDF quads in which they appear in the input RDF dataset.
hash to blank nodes map
A data structure that maps a hash to a list of RDF blank node identifiers.
canonical issuer
An identifier issuer, initialized with the prefix _:c14n, for issuing canonical RDF blank node identifiers.

Blank Node Identifier Issuer State

During the normalization algorithm, it is sometimes necessary to issue new identifiers to RDF blank nodes. The Issue Identifier algorithm uses an identifier issuer to accomplish this task. The information an identifier issuer needs to keep track of is described below.

identifier prefix
The identifier prefix is a string that is used at the beginning of an RDF blank node identifier. It should be initialized to a string that is specified by the normalization algorithm. When generating a new RDF blank node identifier, the prefix is concatenated with a identifier counter. For example, _:c14n is a proper initial value for the identifier prefix that would produce RDF blank node identifiers like _:c14n1.
identifier counter
A counter that is appended to the identifier prefix to create an RDF blank node identifier. It is initialized to 0.
issued identifiers list
A list that tracks previously issued identifiers in the order in which they were issued. It also tracks the existing identifier that triggered the issuance, to prevent issuing more than one new identifier per existing identifier, and to allow RDF blank nodes to be reassigned identifiers some time after issuance.

Normalization Algorithm

The normalization algorithm converts an input RDF dataset into a normalized RDF dataset. This algorithm will assign deterministic identifiers to any RDF blank nodes in the input RDF dataset.

Documenting the algorithm is a WIP, various steps will become more detailed in time.

Overview

Algorithm

  1. Create the normalization state.
  2. For every RDF quad in input RDF dataset:
    1. For each RDF blank node that occurs in the RDF quad, add a reference to the RDF quad using the RDF blank node identifier in the blank node to quads map, creating a new entry if necessary.
  3. Create a list of non-normalized blank node identifiers non-normalized identifiers and populate it using the keys from the blank node to quads map.
  4. Initialize simple, a boolean flag, to true.
  5. While simple is true, issue canonical identifiers for RDF blank nodes:
    1. Set simple to false.
    2. Clear hash to blank nodes map.
    3. For each RDF blank node identifier identifier in non-normalized identifiers:
      1. Create a hash, hash, according to the Hash First Degree Quads algorithm.
      2. Add hash and identifier to hash to blank nodes map, creating a new entry if necessary.
    4. For each hash to identifier list mapping in hash to blank nodes map, lexicographically-sorted by hash:
      1. If the length of identifier list is greater than 1, continue to the next mapping.
      2. Use the Issue Identifier algorithm, passing canonical issuer and the single RDF blank node identifier in identifier list, identifier, to issue a canonical replacment identifier for identifier.
      3. Remove identifier from non-normalized identifiers.
      4. Remove hash from the hash to blank nodes map.
      5. Set simple to true.
  6. For each hash to identifier list mapping in hash to blank nodes map, lexicographically-sorted by hash:
    1. Create hash path list where each item will be a result of running the Hash N-Degree Quads algorithm.
    2. For each RDF blank node identifier identifier in identifier list:
      1. If a canonical identifier has already been issued for identifier, continue to the next identifier.
      2. Create temporary issuer, an identifier issuer initialized with the prefix _:b.
      3. Use the Issue Identifier algorithm, passing temporary issuer and identifier, to issue a new temporary RDF blank node identifier for identifier.
      4. Run the Hash N-Degree Quads algorithm, passing temporary issuer, and append the result to the hash path list.
    3. For each result in the hash path list, lexicographically-sorted by the hash in result:
      1. For each RDF blank node identifier, existing identifier, that was issued a temporary identifier by identifier issuer in result, issue a canonical identifier, in the same order, using the Issue Identifier algorithm, passing canonical issuer and existing identifier.
  7. For each RDF quad, quad, in input RDF dataset:
    1. Create a copy, quad copy, of quad and replace any existing RDF blank node identifiers using the canonical identifiers previously issued by canonical issuer.
    2. Add quad copy to the normalized RDF dataset.
  8. Return the normalized RDF dataset.

Issue Identifier Algorithm

Overview

Algorithm

This algorithm issues a new RDF blank node identifier for a given existing RDF blank node identifier. It also updates state information that tracks the order in which new RDF blank node identifiers were issued.

This algorithm takes an identifier issuer and an existing identifier as inputs. The output is a new issued identifier. The steps of the algorithm are:

  1. If there is already an issued identifier for existing identifier in issued identifiers list, return it.
  2. Generate issued identifier by concatenating identifier prefix with the string value of identifier counter.
  3. Append an item to issued identifiers list that maps existing identifier to issued identifier.
  4. Increment identifier counter.
  5. Return issued identifier.

Hash First Degree Quads

Add note that it is safe to cache the result of hashing the quads for later reuse (to reduce unnecessary computation).

Overview

Algorithm

This algorithm takes the normalization state and a reference blank node identifier as inputs.

  1. Initialize nquads to an empty list. It will be used to store quads in N-Quads format.
  2. Get the list of quads quads associated with the reference blank node identifier in the blank node to quads map.
  3. For each RDF quad quad in quads:
    1. Serialize the RDF quad in N-Quads format with the following special rule:
      1. If any component in quad is an RDF blank node, then serialize it using a special identifier as follows:
        1. If the blank node's existing RDF blank node identifier matches the reference blank node identifier then use the RDF blank node identifier _:a, otherwise, use the RDF blank node identifier _:z.
  4. Sort nquads in lexicographical order.
  5. Return the hash that results from passing the sorted, joined nquads through the hash algorithm.

Hash Related Blank Node

Overview

Algorithm

This algorithm creates a hash to identify how one RDF blank node is related to another. It takes the normalization state, a target RDF blank node identifier, a related RDF blank node identifier, a quad, an identifier issuer, issuer, and a string position as inputs.

  1. Set the identifier to use for related, preferring first the canonical identifier for related if issued, second the identifier issued by issuer if issued, and last, if necessary, the result of the Hash First Degree Quads algorithm, passing related.
  2. Initialize a string input to the value of position.
  3. If position is not g, append <, the value of the RDF predicate in quad, and > to input.
  4. Append identifier to input.
  5. Return the hash that results from passing input through the hash algorithm.

Hash N-Degree Quads

The relationship and difference between this algorithm and the hash first degree quads algorithm should be better explained. There may also be better names for the two algorithms.

The 'path' terminology could also be changed to better indicate what a path is (a particular deterministic serialization for a subgraph/subdataset of nodes without globally-unique identifiers).

Overview

Usually, when trying to determine if two nodes in a graph are equivalent, you simply compare their identifiers. However, what if the nodes don't have identifiers? Then you must determine if the two nodes have equivalent connections to equivalent nodes all throughout the whole graph. This is called the graph isomorphism problem. This algorithm approaches this problem by considering how one might draw a graph on paper. You can test to see if two nodes are equivalent by drawing the graph twice. The first time you draw the graph the first node is drawn in the center of the page. If you can draw the graph a second time such that it looks just like the first, except the second node is in the center of the page, then the nodes are equivalent. This algorithm essentially defines a determinitsic way to draw a graph where, if you begin with a particular node, the graph will always be drawn the same way. If two graphs are drawn the same way with two different nodes, then the nodes are equivalent. A hash is used to indicate a particular way that the graph has been drawn and can be used to compare nodes.

This algorithm works in concert with the main normalization algorithm to produce a unique, deterministic identifier for a particular blank node. This hash incorporates all of the information that is connected to the blank node as well as how it is connected. It does this by creating deterministic paths that emanate out from the blank node through any other adjacent blank nodes.

Algorithm

The inputs to this algorithm are the normalization state, the identifier for the RDF blank node to recursively hash quads for, and path identifier issuer which is an identifier issuer that issues temporary RDF blank node identifiers. The output from this algorithm will be a hash and the identifier issuer used to help generate it.

  1. Create an hash to related blank nodes map for storing hashes that identify related RDF blank nodes.
  2. Get a reference, quads, to the list of RDF quads in the blank node to quads map for the key identifier.
  3. For each quad in quads:
    1. For each component in quad, if component is the RDF subject, RDF object, and RDF graph name and it is an RDF blank node that is not identified by identifier:
      1. Set hash to the result of the Hash Related Blank Node algorithm, passing the identifier as target, the RDF blank node identifier for component as related, quad, path identifier issuer as issuer, and position as either s, o, or g based on whether component is an RDF subject, RDF object, RDF graph name, respectively.
      2. Add a mapping of hash to the RDF blank node identifier for component hash to related blank nodes map, adding an entry as necessary.
  4. For each related hash to blank node list mapping in hash to related blank nodes map, sorted lexicographically by related hash:
    1. Create a string chosen path.
    2. Create an unset chosen issuer variable.
    3. For each permutation of blank node list:
      1. Create a copy of issuer, issuer copy.
      2. Create a string path.
      3. Create a recursion list, to store RDF blank node identifiers that must be recursively processed by this algorithm.
      4. For each related in permutation:
        1. If a canonical identifier has been issued for related, append it to path.
        2. Otherwise, if issuer copy has not issued an identifier for related, append related to recursion list. Use the Issue Identifier algorithm, passing issuer copy and related and append the result to path.
        3. If chosen path is not empty and the length of path is greater than or equal to the length of chosen path and path is lexicographically greater than chosen path, then skip to the next permutation.
      5. For each related in recursion list:
        1. Set result to the result of recursively executing the Hash N-Degree Quads algorithm, passing related for identifier and issuer copy for path identifier issuer.
        2. Use the Issue Identifier algorithm, passing issuer copy and related and append the result to path.
        3. Append <, the hash in result, and > to path.
        4. Set issuer copy to the identifier issuer in result.
        5. If chosen path is not empty and the length of path is greater than or equal to the length of chosen path and path is lexicographically greater than chosen path, then skip to the next permutation.
      6. If chosen path is empty or path is lexicographically less than chosen path, set chosen path to path and chosen issuer to issuer copy.
    4. Return the chosen issuer and the hash that results from passing the lexicographically-sorted related hashes in hash to related blank nodes map and the joined chosen path through the hash algorithm.

Acknowledgements

The editors would like to thank Jeremy Carroll for his work on the graph normalization problem, Gavin Carothers for providing valuable feedback and testing input for the algorithm defined in this specification, Sir Tim Berners Lee for his thoughts on graph normalization over the years, Jesús Arias Fisteus for his work on a similar algorithm.