This document is licensed under a Creative Commons Attribution 3.0 License.
RDF [RDF-SYNTAX] describes a graph-based data model for making claims about the world and provides the foundation for reasoning upon that graph of information. At times, it becomes necessary to compare the differences between graphs, digitally sign graphs, or generate short identifiers for graphs via hashing algorithms. This document outlines an algorithm for normalizing RDF graphs such that these operations can be performed on the normalized graphs.
This document is merely a public working draft of a potential specification. It has no official standing of any kind and does not represent the support or consensus of any standards organisation.
This document is an experimental work in progress.
When data scientists discuss graph normalization, they do so in the context of achieving a particular set of goals. Since a directed graph can express the same information in a variety of different ways, it becomes necessary to be able to transform a graph of information into a standard form for the purposes of calculating differences between two graphs, generating a cryptographically-strong hash identifier for a graph, or digitally signing a graph.
Many RDF graphs, containing nodes with unique identifiers, can be normalized quite easily. However, when a node does not have a unique identifier, graph normalization becomes exponentially more difficult. This problem is called the graph isomorphism problem, and it is believed to be a NP-hard problem in the worst cases. This means that the problem is very difficult to solve in a reasonable timeframe for certain inputs. Luckily, these types of inputs are extremely rare in the real world. Graph isomorphisms are most commonly used as a denial of service attack and thus any software system attempting to solve the graph normalization problem should be able to detect graph isomorphisms correctly.
This document outlines an algorithm for generating a normalized RDF graph given an input RDF graph. The algorithm is called the Universal RDF Graph Normalization Algorithm 2011 or URGNA2011.
This document is a detailed specification for an RDF graph normalization algorithm. The document is primarily intended for the following audiences:
To understand the basics in this specification you must be familiar with basic RDF concepts [RDF-CONCEPTS]. A working knowledge of graph theory and graph isomorphisms is also recommended.
There are a number of ways that one may participate in the development of this specification:
This algorithm is a work in progress, do not implement it.
Normalization is the process of taking input graph and performing a transformation on that input that results in all aspects of the graph being arranged in a deterministic way in the output graph. That is, for two input graphs containing the same information, regardless of their arrangement, the output graphs will be identical. The problem is a fairly difficult technical problem to solve because it requires a directed graph to be ordered into a set of nodes and edges in a deterministic way. This is easy to do when all of the nodes have unique names, but very difficult to do when some of the nodes are not labeled and thus names must be generated for those nodes in a deterministic fashion.
In time, there may be more than one normalization algorithm that will need to be identified. For identification purposes, this algorithm is named the "Universal RDF Graph Normalization Algorithm 2011" (URGNA2011).
s<NUMBER>
or
c<NUMBER>
.
When performing the steps required by the normalization algorithm, it is helpful to track the many pieces of information in a data structure called the normalization state. Many of these pieces simply provide indexes into the graph. The information contained in the normalization state is described below.
_:
and that has a
path, via properties, that starts with the
node reference.
_:
and that has a path, via properties, that ends with
the node reference.
_:
, is not used by any other
node's label in the input graph, and does not
start with the characters _:c14n
. The prefix has two uses.
First it is used to temporarily name nodes during the normalization
algorithm in a way that doesn't collide with the names that already
exist as well as the names that will be generated by the normalization
algorithm. Second, it will eventually be set to _:c14n
to
generate the final, deterministic labels for nodes in the graph. This
prefix will be concatenated with the labeling counter to
produce a node label. For example, _:j8r3k
is
a proper initial value for the labeling prefix.
1
.
The normalization algorithm expands the input graph, flattens the data structure, and creates an initial set of names for all nodes in the graph. The flattened data structure is then processed by a node labeling algorithm in order to get a fully expanded and named list of nodes which is then sorted. The result is a deterministically named and ordered list of graph nodes.
@subject
and the value is the
concatenation of the labeling prefix
and the string value of the labeling counter.
Increment the labeling counter.@iri
and the value is
the value of the @subject
key in the node._:c14n
, relabel the node
using the Node Relabeling Algorithm.
@subject
key associated
with a value starting with _:
according to the steps in the
Deterministic Labeling Algorithm.
This algorithm renames a node by generating a unique new label and updating all references to that label in the node state map. The old label and the normalization state must be given as an input to the algorithm. The old label is the current label of the node that is to be relabeled.
The node relabeling algorithm is as follows:
_:c14n
and the
old label begins with _:c14n
then return as
the node has already been renamed.
The deterministic labeling algorithm takes the normalization state and produces a list of finished nodes that is sorted and contains deterministically named and expanded nodes from the graph.
_:c14n
, the
labeling counter to 1
,
the list of finished nodes to an empty array, and create
an empty array, the list of unfinished nodes._:
then put the node reference in the
list of finished nodes.
_:
then put the node reference in the
list of unfinished nodes.
_:c14n
from the list of unfinished nodes and
add it to the list of finished nodes.
The shallow comparison algorithm takes two unlabeled nodes, alpha and beta, as input and determines which one should come first in a sorted list. The following algorithm determines the steps that are executed in order to determine the node that should come first in a list:
_:
is first.
_:
, then the node associated with the
lexicographically lesser label is first._:c14n
is first.
The object comparison algorithm is designed to compare two graph node property values, alpha and beta, against the other. The algorithm is useful when sorting two lists of graph node properties.
@literal
is first.
@datatype
is first.
@language
is first.
@iri
is first.The deep comparison algorithm is used to compare the difference between two nodes, alpha and beta. A deep comparison takes the incoming and outgoing node edges in a graph into account if the number of properties and value of those properties are identical. The algorithm is helpful when sorting a list of nodes and will return whichever node should be placed first in a list if the two nodes are not truly equivalent.
When performing the steps required by the deep comparison algorithm, it is helpful to track state information about mappings. The information contained in a mapping state is described below.
1
.
s1
and its
index is set to 0
.
The deep comparison algorithm is as follows:
outgoing direction
to the algorithm as inputs.
outgoing direction
to the algorithm as inputs.
incoming direction
to the algorithm as inputs.
incoming direction
to the algorithm as inputs.
The node serialization algorithm takes a node state, a
mapping state, and a direction (either
outgoing direction
or incoming direction
) as
inputs and generates a deterministic serialization for the
node reference.
true
.
outgoing direction
and the
incoming list otherwise, if the label starts with
_:
, it is the target node label:
1
or the length of the
adjacent unserialized labels list, whichever is greater.0
, perform the
Combinatorial Serialization Algorithm
passing the node state, the mapping state for the
first iteration and a copy of it for each subsequent iteration, the
generated serialization label, the direction,
the adjacent serialized labels map, and the
adjacent unserialized labels list.
Decrement the maximum serialization combinations by
1
for each iteration.
The algorithm generates a serialization label given a label and a mapping state and returns the serialization label.
_:c14n
,
the serialization label is the letter c
followed by the number that follows _:c14n
in the
label.
s
followed by the string value of
mapping count. Increment the mapping count by
1
.
The combinatorial serialization algorithm takes a node state, a mapping state, a serialization label, a direction, a adjacent serialized labels map, and a adjacent unserialized labels list as inputs and generates the lexicographically least serialization of nodes relating to the node reference.
1
or the length of the
adjacent unserialized labels list, whichever is greater.
0
:
1
for each iteration.
outgoing direction
then directed serialization refers to the
outgoing serialization and the
directed serialization map refers to the
outgoing serialization map, otherwise it refers to the
incoming serialization and the
directed serialization map refers to the
incoming serialization map. Compare the
serialization string to the
directed serialization according to the
Serialization Comparison Algorithm.
If the serialization string is less than or equal to
the directed serialization:
The serialization comparison algorithm takes two serializations, alpha and beta and returns either which of the two is less than the other or that they are equal.
The mapping serialization algorithm incrementally updates the serialization string in a mapping state.
_
character and the
serialization key to the
serialization string.
true
.
0
onto the key stack.
The label serialization algorithm serializes information about a node that has been assigned a particular serialization label.
[
character to the
label serialization.@subject
property. The keys should be processed in
lexicographical order and their associated values should be processed
in the order produced by the
Object Comparison Algorithm:
<
KEY>
where KEY is the current key. Append string to the
label serialization.@iri
key with a
value that starts
with _:
, set the object string to
the value _:
. If the value does not
start with _:
, build the object string
using the pattern
<
IRI>
where IRI is the value associated with the
@iri
key.@literal
key and a
@datatype
key, build the object string
using the pattern
"
LITERAL"^^<
DATATYPE>
where LITERAL is the value associated with the
@literal
key and DATATYPE is the
value associated with the @datatype
key.@literal
key and a
@language
key, build the object string
using the pattern
"
LITERAL"@
LANGUAGE
where LITERAL is the value associated with the
@literal
key and LANGUAGE is the
value associated with the @language
key."
LITERAL"
where LITERAL is the value associated with the
current key.|
separator character to the
label serialization.]
character to the
label serialization.[
character to the
label serialization.<
PROPERTY>
<
REFERER>
where PROPERTY is the property associated with the
incoming reference and REFERER is either the subject of
the node referring to the label in the incoming reference
or _:
if REFERER begins with
_:
.
|
separator character to the
label serialization.]
character to the
label serialization.The editors would like to thank Jeremy Carroll for his work on the graph normalization problem. Gavin Carothers for providing valuable feedback and testing input for the algorithm defined in this specification. Sir Tim Berners Lee for his thoughts on graph normalization over the years. Jesús Arias Fisteus for his work on a similar algorithm. Finally, a huge thank you goes out to Dave Longley who designed and implemented the algorithms used in this specification, which turned out to be a monumentally difficult design challenge.