Nessie Specification¶
This page documents the complete Nessie specification. This includes:
- API and its constraints
- Contract for value objects
API contract¶
The Nessie API is used by Nessie integrations within for example Apache Iceberg and user facing applications like Web UIs.
Nessie defines a REST API (OpenAPI) and implementations for Java and Python.
Content managed by Nessie¶
General Contract¶
Content Objects describe the state of a data lake object like a table or view. Nessie currently provides types for Iceberg tables views. Nessie uses two identifiers for a single Content object:
- The Content Id is used to identify a content object across all branches even if the content object is being referred to using different table or view names.
- The Content Key is used to look up a content object by name, like a table name or view name. The Content Key changes when the associated table or view is renamed.
Content Key¶
The Content Key consists of multiple strings and is used to resolve a symbolic name, like a table name or a view name used in SQL statements, to a Content object.
When a table or view is renamed using for example an SQL ALTER TABLE RENAME
operation, Nessie will record this operation using a remove operation on the old key plus a put operation on the new key (see below).
On Reference State vs Global State¶
Nessie is designed to support multiple table formats like Apache Iceberg. Since different Nessie commits, think: on different branches in Nessie, can refer to the same physical table but with different state of the data and potentially different schema, some table formats require Nessie to refer to a single Global State.
IDs of the Iceberg snapshot, Iceberg schema, Iceberg partition spec, Iceberg sort order within the Iceberg table metadata are also stored per Nessie named reference (branch or tag), as the so-called on-reference-state.
Note
The term all information in all Nessie commits used above precisely means all information in all Nessie commits that are considered “live”, have not been garbage-collected by Nessie. See also Management Services.
Content Id¶
All contents object must have an id
field. This field is unique to the object and immutable once created. By convention, it is a UUID though this is not enforced by this Specification. There are several expectations on this field:
- Content Ids are immutable. Once created the object will keep the same
id
for its entire lifetime. - If the object is moved (e.g. stored under a different
Key
) it will keep the id. - The same content object, i.e. the same content-id, can be referred to using different keys on different branches.
There is no API to look up an object by id
and the intention of an id
is not to serve in that capacity. An example usage of the id
field might be storing auxiliary data on an object in a local cache and using id
to look up that auxiliary data.
Note
A note about caching: The Content
objects or the values of the referred information (e.g. schema, partitions etc.) might be cached locally by services using Nessie.
For content types that do not track Global State, the hash of the contents object does uniquely reference an object in the Nessie history and is a suitable key to identify an object at a particular point in its history.
Evolution of the Global State is performed in a way that keeps old contents resp. contents on different branches (and tags) available. This is the case for Apache Iceberg.
Content types that do track Global State, the Content Id must be included in the cache key.
For simplicity, it is recommeded to always include the Content Id.
Since the Content object is immutable, the hash is stable and since it is disconnected from Nessie’s version store properties it exists across commits/branches and survives GC and other table maintenance operations.
The commit hash on the other hand makes a poor cache key because multiple commits can refer to the same state of a Content object, e.g. a merge or transplant will change the commit hash but not the state of the Content object.
Content Types¶
Nessie is designed to support various table formats, and currently supports the following types. See also Tables & Views.
Iceberg Table¶
Apache Iceberg describes any table using the so called table metadata, see Iceberg Table Spec. Each Iceberg operation that modifies data, for example an append or rewrite operation or more generally each Iceberg transaction, creates a new Iceberg snapshot. Any Nessie commit refers to a particular Iceberg snapshot for an Iceberg table, which translates to the state of an Iceberg table for a particular Nessie commit.
The Nessie IcebergTable
object passed to Nessie in a Put operation therefore consists of
- the pointer to the Iceberg table metadata and
- the IDs of the Iceberg snapshot, Iceberg schema, Iceberg partition spec, Iceberg sort order within the Iceberg table metadata. (so-called On Reference State)
Note
This model puts a strong restriction on the Iceberg table. All metadata JSON documents must be stored and none of the built-in iceberg maintenance procedures can be used. There are potentially serious issues regarding schema migrations in this model as well. Therefore, the Iceberg table spec should be considered subject to change in the near future.
Iceberg View¶
Note
Iceberg Views are experimental and subject to change!
The state of an Iceberg view is represented using the attributes versionId
, schemaId
, sqlText
and dialect
.
Iceberg views are handled similar to Iceberg Tables.
Operations in a Nessie commit¶
Each Nessie commit carries one or more operations. Each operation contains the Content Key and is either a Put, Delete or Unmodified operation.
A Content Key must only occur once in a Nessie commit.
Operations present in a commit are passed into Nessie as a list of operations.
Mapping SQL DDL to Nessie commit operations¶
A CREATE TABLE
is mapped to one Put operation.
An ALTER TABLE RENAME
is mapped to a Delete operation using the Content Key for the table being renamed plus at least one Put operation using the Content Key of the table’s new name, using the Content Id of the table being renamed.
A DROP TABLE
is represented as a Nessie Delete operation (without a Put operation for the same Content Id).
A DROP TABLE
+ CREATE TABLE
using the same table name (Content Key) in a single commit are mapped to one Put operation with a different Content Id.
Put operation¶
A Put operation modifies the state of the included Content object. It must contain the Content object and, if the Put operation modifies an existing content object, also the the expected contents. The expected contents attribute can be omitted, if the Content object refers to a new Content Id, e.g. a newly created table or view. See also Conflict Resolution.
A Nessie Put operation is created for everything that modifies a table or a view, either its definition (think: SQL DDL) or data (think: SQL DML).
Delete operation¶
A Delete operation does not carry any Content object and is used to indicate that a Content object is no longer referenced using the Content Key of the Delete operation.
Unmodified operation¶
An Unmodified operation does not represent any change of the data, but can be included in a Nessie commit operation to enforce strict serializable transactions. The presence of an Unmodified operation means that the Content object referred to via the operation’s Content Key must not have been modified since the Nessie commit’s expectedHash
.
The Unmodified operation is not persisted.
Version Store¶
See Commit Kernel for details.
Conflict Resolution¶
The API passes an expectedHash
parameter with a Nessie commit operation. This is the commit that the client thinks is the most up to date (its HEAD). The Nessie backend will check to see if the key has been modified since that expectedHash
and if so, it will reject the requested modification with a NessieConflictException
. This is basically an optimistic lock that accounts for the fact that the commit hash is global and nessie branch could have moved on from expectedHash
without modifying the key in question.
A Nessie Put operation that updates an existing content object must pass the so-called expected state, which might be used to compare the current recorded state of a content object with the state in the expected state in the Put operation. If both values differ, Nessie will reject the operation with a NessieConflictException
.
The reason for these conditions is to behave like a ‘real’ database. You shouldn’t have to update your reference before transacting on table A
because it just happened to update table B
whilst you were preparing your transaction.