RDF Graph Normalization

A Standard RDF Graph Normalization Algorithm

Unofficial Draft 16 October 2011

Editors:
Manu Sporny, Digital Bazaar
Dave Longley, Digital Bazaar
Authors:
Dave Longley, Digital Bazaar
Manu Sporny, Digital Bazaar

Abstract

RDF [RDF-SYNTAX] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between graphs, digitally sign graphs, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF graphs such that these operations can be performed on the normalized graphs.

Status of This Document

This document is merely a public working draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organisation.

This document is an experimental work in progress.

Table of Contents

1. Introduction

When data scientists discuss graph normalization, they do so in the context of achieving a particular set of goals. Since a directed graph can express the same information in a variety of different ways, it becomes necessary to be able to transform a graph of information into a standard form for the purposes of calculating differences between two graphs, generating a cryptographically-strong hash identifier for a graph, or digitally signing a graph.

Many RDF graphs, containing nodes with unique identifiers, can be normalized quite easily. However, when a node does not have a unique identifier, graph normalization becomes exponentially more difficult. This problem is called the graph isomorphism problem, and it is believed to be a NP-hard problem in the worst cases. This means that the problem is very difficult to solve in a reasonable timeframe for certain inputs. Luckily, these types of inputs are extremely rare in the real world. Graph isomorphisms are most commonly used as a denial of service attack and thus any software system attempting to solve the graph normalization problem should be able to detect graph isomorphisms correctly.

This document outlines an algorithm for generating a normalized RDF graph given an input RDF graph. The algorithm is called the Universal RDF Graph Normalization Algorithm 2011 or URGNA2011.

1.1 How to Read this Document

This document is a detailed specification for an RDF graph normalization algorithm. The document is primarily intended for the following audiences:

To understand the basics in this specification you must be familiar with basic RDF concepts [RDF-CONCEPTS]. A working knowledge of graph theory and graph isomorphisms is also recommended.

1.2 Contributing

There are a number of ways that one may participate in the development of this specification:

2. Normalization

This algorithm is a work in progress, do not implement it.

Normalization is the process of taking input graph and performing a transformation on that input that results in all aspects of the graph being arranged in a deterministic way in the output graph. That is, for two input graphs containing the same information, regardless of their arrangement, the output graphs will be identical. The problem is a fairly difficult technical problem to solve because it requires a directed graph to be ordered into a set of nodes and edges in a deterministic way. This is easy to do when all of the nodes have unique names, but very difficult to do when some of the nodes are not labeled and thus names must be generated for those nodes in a deterministic fashion.

In time, there may be more than one normalization algorithm that will need to be identified. For identification purposes, this algorithm is named the "Universal RDF Graph Normalization Algorithm 2011" (URGNA2011).

2.1 Normalization Algorithm Terms

input graph
The data structure that is provided as input to the algorithm.
output graph
The data structure that is produced as output by the algorithm.
label
The subject IRI associated with a graph node.
list of expanded nodes
A list of all nodes in the input graph.
alpha and beta values
The words alpha and beta refer to the first and second nodes or values being examined in an algorithm. The names are merely used to refer to each input value to a comparison algorithm.
renaming counter
A counter that is used during the Node Relabeling Algorithm. The counter typically starts at one (1) and counts up for every node that is relabeled. There will be two such renaming counters in an implementation of the normalization algorithm. The first is the labeling counter and the second is the deterministic labeling counter.
serialization label
An identifier that is created to aid in the normalization process in the Deep Comparison Algorithm. The value typically takes the form of s<NUMBER> or c<NUMBER>.

2.2 Normalization State

When performing the steps required by the normalization algorithm, it is helpful to track the many pieces of information in a data structure called the normalization state. Many of these pieces simply provide indexes into the graph. The information contained in the normalization state is described below.

node state
Each node in the graph will be assigned a node state. This state contains the information necessary to deterministically label all nodes in the graph. A node state includes:
node reference
A node reference is a reference to a node in the graph. For a given node state, its node reference refers to the node that the state is for. When a node state is created, its node reference should be to the node it is created for.
outgoing list
Lists the labels for all nodes that are properties of the node reference. This list should be initialized by iterating over every object associated with a property in the node reference adding its label if it is another node.
incoming list
Lists the labels for all nodes in the graph for which the node reference is a property. This list is initialized to an empty list.
outgoing serialization map
Maps node labels to serialization labels. This map is initialized to an empty map. When this map is populated, it will be filled with keys that are the labels of every node in the graph with a label that begins with _: and that has a path, via properties, that starts with the node reference.
outgoing serialization
A string that can be lexicographically compared to the outgoing serializations of other node states. It is a representation of the outgoing serialization map and other related information. This string is initialized to an empty string.
incoming serialization map
Maps node labels to serialization labels. This map is initialized to an empty map. When this map is populated, it will be filled with keys that are the labels of every node in the graph with a label that begins with _: and that has a path, via properties, that ends with the node reference.
incoming serialization
A string that can be lexicographically compared to the outgoing serializations of other node states. It is a representation of the incoming serialization map and other related information. This string is initialized to an empty string.
node state map
A mapping from a node's label to a node state. It is initialized to an empty map.
labeling prefix
The labeling prefix is a string that is used as the beginning of a node label. It should be initialized to a random base string that starts with the characters _:, is not used by any other node's label in the input graph, and does not start with the characters _:c14n. The prefix has two uses. First it is used to temporarily name nodes during the normalization algorithm in a way that doesn't collide with the names that already exist as well as the names that will be generated by the normalization algorithm. Second, it will eventually be set to _:c14n to generate the final, deterministic labels for nodes in the graph. This prefix will be concatenated with the labeling counter to produce a node label. For example, _:j8r3k is a proper initial value for the labeling prefix.
labeling counter
A counter that is used to label nodes. It is appended to the labeling prefix to create a node label. It is initialized to 1.
map of flattened nodes
A map containing a representation of all nodes in the graph where the key is a node label and the value is a single JSON object that has no nested sub-objects and has had all properties for the same node merged into a single JSON object.

2.3 Normalization Algorithm

The normalization algorithm expands the input graph, flattens the data structure, and creates an initial set of names for all nodes in the graph. The flattened data structure is then processed by a node labeling algorithm in order to get a fully expanded and named list of nodes which is then sorted. The result is a deterministically named and ordered list of graph nodes.

  1. Expand the input graph according to the steps in the Expansion Algorithm and store the result as the expanded input.
  2. Create a normalization state.
  3. Initialize the map of flattened nodes by recursively processing every expanded node in the expanded input in depth-first order:
    1. If the expanded node is an unlabeled node, add a new key-value pair to the expanded node where the key is @subject and the value is the concatenation of the labeling prefix and the string value of the labeling counter. Increment the labeling counter.
    2. Add the expanded node to the map of flattened nodes:
      1. If the expanded node's label is already in the map of flattened nodes merge all properties from the entry in the map of flattened nodes into the expanded node.
      2. Go through every property associated with an array in the expanded node and remove any duplicate IRI entries from the array. If the resulting array only has one IRI entry, change it from an array to an object.
      3. Set the entry for the expanded node's label in the map of flattened nodes to the expanded node.
    3. After exiting the recursive step, replace the reference to the expanded node with an object containing a single key-value pair where the key is @iri and the value is the value of the @subject key in the node.
  4. For every entry in the map of flattened nodes, insert a key-value pair into the node state map where the key is the key from the map of flattened nodes and the value is a node state where its node reference refers to the value from the map of flattened nodes.
  5. Populate the incoming list for each node state by iterating over every node in the graph and adding its label to the incoming list associated with each node found in its properties.
  6. For every entry in the node state map that has a label that begins with _:c14n, relabel the node using the Node Relabeling Algorithm.
  7. Label all of the nodes that contain a @subject key associated with a value starting with _: according to the steps in the Deterministic Labeling Algorithm.

2.4 Node Relabeling Algorithm

This algorithm renames a node by generating a unique new label and updating all references to that label in the node state map. The old label and the normalization state must be given as an input to the algorithm. The old label is the current label of the node that is to be relabeled.

The node relabeling algorithm is as follows:

  1. If the labeling prefix is _:c14n and the old label begins with _:c14n then return as the node has already been renamed.
  2. Generate the new label by concatenating the labeling prefix with the string value of the labeling counter. Increment the labeling counter.
  3. For the node state associated with the old label, update every node in the incoming list by changing all the properties that reference the old label to the new label.
  4. Change the old label key in the node state map to the new label and set the associated node reference's label to the new label.

2.5 Deterministic Labeling Algorithm

The deterministic labeling algorithm takes the normalization state and produces a list of finished nodes that is sorted and contains deterministically named and expanded nodes from the graph.

  1. Set the labeling prefix to _:c14n, the labeling counter to 1, the list of finished nodes to an empty array, and create an empty array, the list of unfinished nodes.
  2. For each node reference in the node state map:
    1. If the node's label does not start with _: then put the node reference in the list of finished nodes.
    2. If the node's label does start with _: then put the node reference in the list of unfinished nodes.
  3. Append to the list of finished nodes by processing the remainder of the list of unfinished nodes until it is empty:
    1. Sort the list of unfinished nodes in descending order according to the Deep Comparison Algorithm to determine the sort order.
    2. Create a list of labels and initialize it to an empty array.
    3. For the first node from the list of unfinished nodes:
      1. Add its label to the list of labels.
      2. For each key-value pair from its associated outgoing serialization map, add the key to a list and then sort the list according to the lexicographical order of the keys' associated values. Append the list to the list of nodes to label.
      3. For each key-value pair from its associated incoming serialization map, add the key to a list and then sort the list according to the lexicographical order of the keys' associated values. Append the list to the list of nodes to label.
    4. For each label in the list of labels, relabel the associated node according to the Node Relabeling Algorithm. If any outgoing serialization map contains a key that matches the label, clear the map and set the associated outgoing serialization to an empty string. If any incoming serialization map contains a key that matches the label, clear the map and set the associated incoming serialization to an empty string.
    5. Remove each node with a label that starts with _:c14n from the list of unfinished nodes and add it to the list of finished nodes.
  4. Sort the list of finished nodes in descending order according to the Deep Comparison Algorithm to determine the sort order.

2.6 Shallow Comparison Algorithm

The shallow comparison algorithm takes two unlabeled nodes, alpha and beta, as input and determines which one should come first in a sorted list. The following algorithm determines the steps that are executed in order to determine the node that should come first in a list:

  1. Compare the total number of node properties. The node with fewer properties is first.
  2. Lexicographically sort the property IRIs for each node and compare the sorted lists. If an IRI is found to be lexicographically smaller, the node containing that IRI is first.
  3. Compare the values of each property against one another:
    1. The node associated with fewer property values is first.
    2. Create an alpha list by adding all values associated with the alpha property that are not unlabeled nodes.
    3. Create a beta list by adding all values associated with the beta property that is not an unlabeled node.
    4. Compare the length of alpha list and beta list. The node associated with the list containing the fewer number of items is first.
    5. Sort alpha list and beta list according to the Object Comparison Algorithm. For each offset into the alpha list, compare the item at the offset against the item at the same offset in the beta list according to the Object Comparison Algorithm. The node associated with the lesser item is first.
  4. Process the incoming lists associated with each node to determine order:
    1. The node with the shortest incoming list is first.
    2. Sort the incoming lists according to incoming property and then incoming label.
    3. The node associated with the fewest number of incoming nodes is first.
    4. For each offset into the incoming lists, compare the associated properties and labels:
      1. The node associated with a label that does not begin with _: is first.
      2. If the nodes' labels do not begin with _:, then the node associated with the lexicographically lesser label is first.
      3. The node associated with the lexicographically lesser associated property is first.
      4. The node with the label that does not begin with _:c14n is first.
      5. The node with the lexicographically lesser label is first.
  5. Otherwise, the nodes are equivalent.

2.7 Object Comparison Algorithm

The object comparison algorithm is designed to compare two graph node property values, alpha and beta, against the other. The algorithm is useful when sorting two lists of graph node properties.

  1. If one of the values is a string and the other is not, the value that is a string is first.
  2. If both values are strings, the lexicographically lesser string is first.
  3. If one of the values is a literal and the other is not, the value that is a literal is first.
  4. If both values are literals:
    1. The lexicographically lesser string associated with @literal is first.
    2. The lexicographically lesser string associated with @datatype is first.
    3. The lexicographically lesser string associated with @language is first.
  5. If both values are expanded IRIs, the lexicographically lesser string associated with @iri is first.
  6. Otherwise, the two values are equivalent.

2.8 Deep Comparison Algorithm

The deep comparison algorithm is used to compare the difference between two nodes, alpha and beta. A deep comparison takes the incoming and outgoing node edges in a graph into account if the number of properties and value of those properties are identical. The algorithm is helpful when sorting a list of nodes and will return whichever node should be placed first in a list if the two nodes are not truly equivalent.

When performing the steps required by the deep comparison algorithm, it is helpful to track state information about mappings. The information contained in a mapping state is described below.

mapping state
mapping counter
Keeps track of the number of nodes that have been mapped to serialization labels. It is initialized to 1.
processed labels map
Keeps track of the labels of nodes that have already been assigned serialization labels. It is initialized to an empty map.
serialized labels map
Maps a node label to its associated serialization label. It is initialized to an empty map.
adjacent info map
Maps a serialization label to the node label associated with it, the list of sorted serialization labels for adjacent nodes, and the map of adjacent node serialiation labels to their associated node labels. It is initialized to an empty map.
key stack
A stack where each element contains an array of adjacent serialization labels and an index into that array. It is initialized to a stack containing a single element where its array contains a single string element s1 and its index is set to 0.
serialized keys
Keeps track of which serialization labels have already been written at least once to the serialization string. It is initialized to an empty map.
serialization string
A string that is incrementally updated as a serialization is built. It is initialized to an empty string.

The deep comparison algorithm is as follows:

  1. Perform a comparison between alpha and beta according to the Shallow Comparison Algorithm. If the result does not show that the two nodes are equivalent, return the result.
  2. Compare incoming and outgoing edges for each node, updating their associated node state as each node is processed:
    1. If the outgoing serialization map for alpha is empty, generate the serialization according to the Node Serialization Algorithm. Provide alpha's node state, a new mapping state, outgoing direction to the algorithm as inputs.
    2. If the outgoing serialization map for beta is empty, generate the serialization according to the Node Serialization Algorithm. Provide beta's node state, a new mapping state, and outgoing direction to the algorithm as inputs.
    3. If alpha's outgoing serialization is lexicographically less than beta's, then alpha is first. If it is greater, then beta is first.
    4. If the incoming serialization map for alpha is empty, generate the serialization according to the Node Serialization Algorithm. Provide alpha's node state, a new mapping state with its serialized labels map set to a copy of alpha's outgoing serialization map, and incoming direction to the algorithm as inputs.
    5. If the incoming serialization map for beta is empty, generate the serialization according to the Node Serialization Algorithm. Provide beta's node state, a new mapping state with its serialized labels map set to a copy of beta's outgoing serialization map, and incoming direction to the algorithm as inputs.
    6. If alpha's incoming serialization is lexicographically less than beta's, then alpha is first. If it is greater, then beta is first.

2.9 Node Serialization Algorithm

The node serialization algorithm takes a node state, a mapping state, and a direction (either outgoing direction or incoming direction) as inputs and generates a deterministic serialization for the node reference.

  1. If the label exists in the processed labels map, terminate the algorithm as the serialization label has already been created.
  2. Set the value associated with the label in the processed labels map to true.
  3. Generate the next serialization label for the label according to the Serialization Label Generation Algorithm.
  4. Create an empty map called the adjacent serialized labels map that will store mappings from serialized labels to adjacent node labels.
  5. Create an empty array called the adjacent unserialized labels list that will store labels of adjacent nodes that haven't been assigned serialization labels yet.
  6. For every label in a list, where the list the outgoing list if the direction is outgoing direction and the incoming list otherwise, if the label starts with _:, it is the target node label:
    1. Look up the target node label in the processed labels map and if a mapping exists, update the adjacent serialized labels map where the key is the value in the serialization map and the value is the target node label.
    2. Otherwise, add the target node label to the adjacent unserialized labels list.
  7. Set the maximum serialization combinations to 1 or the length of the adjacent unserialized labels list, whichever is greater.
  8. While the maximum serialization combinations is greater than 0, perform the Combinatorial Serialization Algorithm passing the node state, the mapping state for the first iteration and a copy of it for each subsequent iteration, the generated serialization label, the direction, the adjacent serialized labels map, and the adjacent unserialized labels list. Decrement the maximum serialization combinations by 1 for each iteration.

2.10 Serialization Label Generation Algorithm

The algorithm generates a serialization label given a label and a mapping state and returns the serialization label.

  1. If the label is already in the serialization labels map, return its associated value.
  2. If the label starts with the string _:c14n, the serialization label is the letter c followed by the number that follows _:c14n in the label.
  3. Otherwise, the serialization label is the letter s followed by the string value of mapping count. Increment the mapping count by 1.
  4. Create a new key-value pair in the serialization labels map where the key is the label and the value is the generated serialization label.

2.11 Combinatorial Serialization Algorithm

The combinatorial serialization algorithm takes a node state, a mapping state, a serialization label, a direction, a adjacent serialized labels map, and a adjacent unserialized labels list as inputs and generates the lexicographically least serialization of nodes relating to the node reference.

  1. If the adjacent unserialized labels list is not empty:
    1. Copy the adjacent serialized labels map to the adjacent serialized labels map copy.
    2. Remove the first unserialized label from the adjacent unserialized labels list and create a new new serialization label according to the Serialization Label Generation Algorithm.
    3. Create a new key-value mapping in the adjacent serialized labels map copy where the key is the new serialization label and the value is the unserialized label.
    4. Set the maximum serialization rotations to 1 or the length of the adjacent unserialized labels list, whichever is greater.
    5. While the maximum serialization rotations is greater than 0:
      1. Recursively perform the Combinatorial Serialization Algorithm passing the mapping state for the first iteration of the loop, and a copy of it for each subsequent iteration.
      2. Rotate the elements in the adjacent unserialized labels list by shifting each of them once to the right, moving the element at the end of the list to the beginning of the list.
      3. Decrement the maximum serialization rotations by 1 for each iteration.
  2. If the adjacent unserialized labels list is empty:
    1. Create a list of keys from the keys in the adjacent serialized labels map and sort it lexicographically.
    2. Add a key-value pair to the adjacent info map where the key is the serialization label and the value is an object containing the node reference's label, the list of keys and the adjacent serialized labels map.
    3. Update the serialization string according to the Mapping Serialization Algorithm.
    4. If the direction is outgoing direction then directed serialization refers to the outgoing serialization and the directed serialization map refers to the outgoing serialization map, otherwise it refers to the incoming serialization and the directed serialization map refers to the incoming serialization map. Compare the serialization string to the directed serialization according to the Serialization Comparison Algorithm. If the serialization string is less than or equal to the directed serialization:
      1. For each value in the list of keys, run the Node Serialization Algorithm.
      2. Update the serialization string according to the Mapping Serialization Algorithm.
      3. Compare the serialization string to the directed serialization again and if it is less than or equal and the length of the serialization string is greater than or equal to the length of the directed serialization, then set the directed serialization to the serialization string and set the directed serialization map to the serialized labels map.

2.12 Serialization Comparison Algorithm

The serialization comparison algorithm takes two serializations, alpha and beta and returns either which of the two is less than the other or that they are equal.

  1. Whichever serialization is an empty string is greater. If they are both empty strings, they are equal.
  2. Return the result of a lexicographical comparison of alpha and beta up to the number of characters in the shortest of the two serializations.

2.13 Mapping Serialization Algorithm

The mapping serialization algorithm incrementally updates the serialization string in a mapping state.

  1. If the key stack is not empty:
    1. Pop the serialization key info off of the key stack.
    2. For each serialization key in the serialization key info array, starting at the serialization key index from the serialization key info:
      1. If the serialization key is not in the adjacent info map, push the serialization key info onto the key stack and exit from this loop.
      2. If the serialization key is a key in serialized keys, a cycle has been detected. Append the concatenation of the _ character and the serialization key to the serialization string.
      3. Otherwise, serialize all outgoing and incoming edges in the related node by performing the following steps:
        1. Mark the serialization key as having been processed by adding a new key-value pair to serialized keys where the key is the serialization key and the value is true.
        2. Set the serialization fragment to the value of the serialization key.
        3. Set the adjacent info to the value of the serialization key in the adjacent info map.
        4. Set the adjacent node label to the node label from the adjacent info.
        5. If a mapping for the adjacent node label exists in the map of all labels:
          1. Append the result of the Label Serialization Algorithm to the serialization fragment.
        6. Append all of the keys in the adjacent info to the serialization fragment.
        7. Append the serialization fragment to the serialization string.
        8. Push a new key info object containing the keys from the adjacent info and an index of 0 onto the key stack.
        9. Recursively update the serialization string according to the Mapping Serialization Algorithm.

2.14 Label Serialization Algorithm

The label serialization algorithm serializes information about a node that has been assigned a particular serialization label.

  1. Initialize the label serialization to an empty string.
  2. Append the [ character to the label serialization.
  3. Append all properties to the label serialization by processing each key-value pair in the node reference, excluding the @subject property. The keys should be processed in lexicographical order and their associated values should be processed in the order produced by the Object Comparison Algorithm:
    1. Build a string using the pattern <KEY> where KEY is the current key. Append string to the label serialization.
    2. The value may be a single object or an array of objects. Process all of the objects that are associated with the key, building an object string for each item:
      1. If the object contains an @iri key with a value that starts with _:, set the object string to the value _:. If the value does not start with _:, build the object string using the pattern <IRI> where IRI is the value associated with the @iri key.
      2. If the object contains a @literal key and a @datatype key, build the object string using the pattern "LITERAL"^^<DATATYPE> where LITERAL is the value associated with the @literal key and DATATYPE is the value associated with the @datatype key.
      3. If the object contains a @literal key and a @language key, build the object string using the pattern "LITERAL"@LANGUAGE where LITERAL is the value associated with the @literal key and LANGUAGE is the value associated with the @language key.
      4. Otherwise, the value is a string. Build the object string using the pattern "LITERAL" where LITERAL is the value associated with the current key.
      5. If this is the second iteration of the loop, append a | separator character to the label serialization.
      6. Append the object string to the label serialization.
  4. Append the ] character to the label serialization.
  5. Append the [ character to the label serialization.
  6. Append all incoming references for the current label to the label serialization by processing all of the items associated with the incoming list:
    1. Build a reference string using the pattern <PROPERTY><REFERER> where PROPERTY is the property associated with the incoming reference and REFERER is either the subject of the node referring to the label in the incoming reference or _: if REFERER begins with _:.
    2. If this is the second iteration of the loop, append a | separator character to the label serialization.
    3. Append the reference string to the label serialization.
  7. Append the ] character to the label serialization.
  8. Append all adjacent node labels to the label serialization by concatenating the string value for all of them, one after the other, to the label serialization.
  9. Push the adjacent node labels onto the key stack and append the result of the Mapping Serialization Algorithm to the label serialization.

A. Acknowledgements

The editors would like to thank Jeremy Carroll for his work on the graph normalization problem. Gavin Carothers for providing valuable feedback and testing input for the algorithm defined in this specification. Sir Tim Berners Lee for his thoughts on graph normalization over the years. Jesús Arias Fisteus for his work on a similar algorithm. Finally, a huge thank you goes out to Dave Longley who designed and implemented the algorithms used in this specification, which turned out to be a monumentally difficult design challenge.

B. References

B.1 Normative references

[RDF-CONCEPTS]
Graham Klyne; Jeremy J. Carroll. Resource Description Framework (RDF): Concepts and Abstract Syntax. 10 February 2004. W3C Recommendation. URL: http://www.w3.org/TR/2004/REC-rdf-concepts-20040210

B.2 Informative references

[RDF-SYNTAX]
Ora Lassila; Ralph R. Swick. Resource Description Framework (RDF) Model and Syntax Specification. 22 February 1999. W3C Recommendation. URL: http://www.w3.org/TR/1999/REC-rdf-syntax-19990222