Skip to content

Commit 558e1a9

Browse files
committed
added REST API and binary RDF docs
Signed-off-by: Jeen Broekstra <jeen.broekstra@gmail.com>
1 parent 00b39d8 commit 558e1a9

4 files changed

Lines changed: 1261 additions & 0 deletions

File tree

build-html

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,7 @@ asciidoctor -n -D html/server-workbench-console doc/server-workbench-console/ind
44
asciidoctor -n -D html/migration doc/migration/index.adoc
55
asciidoctor -n -D html/programming doc/programming/index.adoc
66
asciidoctor -n -D html/getting-started doc/getting-started/index.adoc
7+
asciidoctor -n -D html/rest-api doc/rest-api/index.adoc
8+
asciidoctor -n -D html/rdf4j-binary doc/rdf4j-binary/index.adoc
79
cp -r doc/images html/
810
cp -r doc/css html/

doc/index.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ include::header.adoc[]
1717
- link:migration[Sesame to RDF4J Migration Guide]
1818
- link:programming[Programming with RDF4J]
1919
- link:server-workbench-console[RDF4J Server, Workbench, and Console]
20+
- link:rest-api[RDF4J Server REST API]
21+
- link:rdf4j-binary[RDF4J Binary RDF Format]
2022

2123
=== Javadoc
2224

doc/rdf4j-binary/index.adoc

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
include::../shared-settings.adoc[]
2+
:toclevels: 3
3+
include::../header.adoc[]
4+
5+
:numbered!:
6+
== RDF4J Binary RDF Format
7+
8+
RDF4J supports reading and writing a custom binary RDF serialization format. Its main features are reduced parsing overhead and minimal memory requirements (for handling really long literals, amongst other things).
9+
10+
:numbered:
11+
== MIME Content Type
12+
13+
RDF4J assigns the content type `application/x-binary-rdf` to its format.
14+
15+
== Overall design
16+
17+
Results encoded in the RDF4J Binary RDF format consist of a header followed by zero or more records, and closes with an `END_OF_DATA` marker (see below). Values are stored in network order (Big-Endian).
18+
19+
All string values use UTF-16 encoding. Reference ids are assigned to recurring values to avoid having to repeat long strings.
20+
21+
== Header
22+
23+
The header is 8 bytes long:
24+
25+
- Bytes 0-3 contain a magic number, namely the ASCII codes for the string “BRDF”, which stands for Binary RDF.
26+
- Bytes 4-7 specify the format version (a 4-byte signed integer).
27+
28+
For example, a header for a result in format version 1 will look like this:
29+
30+
....
31+
byte: 0 1 2 3 | 4 5 6 7 |
32+
-------------------+-------------+
33+
value: B R D F | 0 0 0 1 |
34+
....
35+
36+
== Content records
37+
38+
Zero or more records follow after the header. Each record can be a namespace declaration, a comment, a value reference declaration, or a statement.
39+
40+
Each record starts with a record type marker (a single byte). The following record types are defined in the current format:
41+
42+
- `NAMESPACE_DECL` (byte value: 0):
43+
This indicates a namespace declaration record.
44+
- `STATEMENT` (byte value: 1):
45+
This indicates an RDF statement record.
46+
- `COMMENT` (byte value: 2):
47+
This indicates a comment record.
48+
- `VALUE_DECL` (byte value: 3):
49+
This indicates a value declaration.
50+
- `END_OF_DATA` (byte value: 127):
51+
This indicates the end of the data stream has been reached.
52+
53+
=== Strings
54+
55+
All strings are encoded as UTF-16 encoded byte arrays. A String is preceeded by a 4-byte signed integer that encodes the length of the string (specifically, it records the number of Unicode code units). For example, the string ‘foo’ will be encoded as follows:
56+
57+
....
58+
byte: 0 1 2 3 | 4 6 8 |
59+
---------------+-------+
60+
value: 0 0 0 3 | f o o |
61+
....
62+
63+
=== RDF Values
64+
65+
Each RDF value type has its own specific 1-byte record type marker:
66+
67+
- `NULL_VALUE` (byte value: 0)
68+
marks an empty RDF value (this is used, for example, in encoding of context in statements)
69+
- `URI_VALUE` (byte value: 1)
70+
marks a URI value
71+
- `BNODE_VALUE` (byte value: 2)
72+
marks a blank node value
73+
- `PLAIN_LITERAL_VALUE` (byte value: 3)
74+
marks a plain literal value
75+
- `LANG_LITERAL_VALUE` (byte value: 4)
76+
marks a language-tagged literal value
77+
- `DATATYPE_LITERAL_VALUE` (byte value: 5)
78+
marks a datatyped literal value
79+
80+
==== URIs
81+
82+
URIs are recorded by the `URI_VALUE` marker followed by the URI encoded as a string.
83+
84+
==== Blank nodes
85+
86+
Blank nodes are recorded by the `BNODE_VALUE` marker followed by the id of the blank node encoded as a string.
87+
88+
==== Literals
89+
90+
Depending on the specific literal type (plain, language-tagged, datatyped), a literal is recorded by one of the markers `PLAIN_LITERAL_VALUE`, `LANG_LITERAL_VALUE` or `DATATYPE_LITERAL_VALUE`. This is followed by the lexical label of the literal as a string, optionally followed by either a language tag encoded as a string value or a datatype encoded as a string.
91+
92+
=== Value reference declaration records
93+
94+
To enable further compression of the byte stream, the Binary RDF format enables encoding of reference-identifiers for often-repeated RDF values. A value reference declaration starts with a `VALUE_DECL` record marker (1 byte, value 3), followed by a 4-byte signed integer that encodes the reference id. This is followed by the actual value, encoded as an RDF value (see above).
95+
96+
For example, a declaration that assigns id 42 to the URI ‘http://example.org/HHGTTG’ will look like this:
97+
98+
....
99+
byte: 0 | 1 2 3 4 | 5 | 6 7 8 9 | 10 12 14 16 18 (etc) |
100+
----------+---------+---+---------+----------------------+
101+
value: 3 | 0 0 0 42| 1 | 0 0 0 25| h t t p : (etc) |
102+
....
103+
104+
Explanation: byte 0 marks the record as a `VALUE_DECL`, bytes 1-4 encode the reference id, byte 5 encodes the value type (`URI_VALUE`), bytes 6-9 encode the length of the string value, bytes 10 and further encode the actual string value as an UTF-16 encoded byte array.
105+
106+
Note that the format allows the same reference id to be assigned more than once. When a second value declaration occurs, it effectively overwrites a previous declaration, reassigning the id to a new value for all following statements.
107+
108+
=== Namespace records
109+
110+
A namespace declaration is recorded by the `NAMESPACE_DECL marker`. Next follows the namespace prefix, as a string, then followed by the namespace URI, as a string.
111+
112+
For example, a namespace declaration record for prefix ‘ex’ and namespace uri ‘http://example.org/’ will look like this:
113+
114+
....
115+
byte: 0 | 1 2 3 4 | 5 6 | 7 8 9 10 | 11 13 15 17 19 (etc) |
116+
----------+---------+-----+----------+----------------------+
117+
value: 0 | 0 0 0 2 | e x | 0 0 0 19 | h t t p : (etc) |
118+
....
119+
120+
=== Comment records
121+
122+
A comment is recorded by the `COMMENT` marker, followed by the comment text encoded as a string.
123+
124+
For example, a record for the comment ‘example’ will look like this:
125+
126+
....
127+
byte: 0 | 1 2 3 4 | 5 7 9 11 13 15 17 |
128+
----------+---------+-------------------+
129+
value: 2 | 0 0 0 7 | e x a m p l e |
130+
....
131+
132+
=== Statement records
133+
134+
Each statement record starts with a `STATEMENT` marker (1 byte, value 1). For the encoding of the statement’s subject, predicate, object and context, either the RDF value is encoded directly, or a previously assigned value reference (see section 2.3) is reused. A Value references is recorded with the `VALUE_REF` marker (1 byte, value 6), followed by the reference id as a 4-byte signed integer.
135+
136+
==== An example statement
137+
138+
Consider the following RDF statement:
139+
140+
<http://example.org/George> <http://example.org/name> "George" .
141+
142+
Assume that the subject and predicate previously been assigned reference ids,
143+
(42 and 43 respecively). The object value has not been assigned a reference id.
144+
145+
This statement would then be recorded as follows:
146+
147+
....
148+
byte: 0 | 1 | 2 3 4 5 | 6 | 7 8 9 10| 11 | 12 13 14 15 | 16 18 20 22 24 26 | 28 |
149+
---------+---+---------+---+---------+----+-------------+-------------------+----+
150+
value: 1 | 6 | 0 0 0 42| 6 | 0 0 0 43| 3 | 0 0 0 5 | G e o r g e | 0 |
151+
....
152+
153+
Explanation: byte 0 marks the record as a `STATEMENT`. Byte 1 marks the subject of the statement as a `VALUE_REF`. Bytes 2-5 encode the reference id of the subject. Byte 6 marks the predicate of the statement as a `VALUE_REF`. Byte 7-10 encode the reference id of the predicate. Byte 11 marks the obect of the statement as a `PLAIN_LITERA`L value, bytes 12-15 encode the length of the lexical value of the literal, and bytes 16-26 encode the literal’s lexical value as a UTF-16 encoded byte array. Finally, byte 28 marks the context field of the statement as a `NULL_VALUE`.
154+
155+
== Buffering and value reference handling
156+
157+
The binary RDF format enables declaration of value references for more compressed representation of often-repeated values.
158+
159+
A binary RDF producer may choose to introduce a reference for every RDF value. This is a simple approach, but it produces a suboptimal compression (because for values which occur only once, direct encoding of the value uses fewer bytes than introducing a reference for it).
160+
161+
Another approach is to introduce a buffered writing strategy: statements to be serialized are put on a queue with a certain capacity, and for each RDF value in these queued statements the number of occurrences in the queue is determined. As the queue is emptied and each statement is serialized, all values that occur more than once in the queue are assigned a reference id. This is, in fact, the strategy employed by the Rio Writer.
162+
163+
It is also important to note that reference ids are not necessarily global over the entire document: ids are assigned on the basis of number of occurrences of a value in the current statement queue. If that number drops to zero, the reference id for that value can be ‘recycled’, that is, reassigned to another value. This ensures that we never run out of reference ids, even for very large datasets.

0 commit comments

Comments
 (0)