Skip to content

Nessie Specification

This page documents the complete Nessie specification. This includes:

  • API and its constraints
  • Contract for value objects

API contract

The Nessie API is used by Nessie integrations within for example Apache Iceberg and user facing applications like Web UIs.

Nessie defines a REST API (OpenAPI) and implementations for Java and Python.

Content managed by Nessie

General Contract

Content Objects describe the state of a data lake object like a table or view. Nessie currently provides types for Iceberg tables views. Nessie uses two identifiers for a single Content object:

  1. The Content Id is used to identify a content object across all branches even if the content object is being referred to using different table or view names.
  2. The Content Key is used to look up a content object by name, like a table name or view name. The Content Key changes when the associated table or view is renamed.

Content Key

The Content Key consists of multiple strings and is used to resolve a symbolic name, like a table name or a view name used in SQL statements, to a Content object.

When a table or view is renamed using for example an SQL ALTER TABLE RENAME operation, Nessie will record this operation using a remove operation on the old key plus a put operation on the new key (see below).

On Reference State vs Global State

Nessie is designed to support multiple table formats like Apache Iceberg. Since different Nessie commits, think: on different branches in Nessie, can refer to the same physical table but with different state of the data and potentially different schema, some table formats require Nessie to refer to a single Global State.

IDs of the Iceberg snapshot, Iceberg schema, Iceberg partition spec, Iceberg sort order within the Iceberg table metadata are also stored per Nessie named reference (branch or tag), as the so-called on-reference-state.

Note

The term all information in all Nessie commits used above precisely means all information in all Nessie commits that are considered “live”, have not been garbage-collected by Nessie. See also Management Services.

Content Id

All contents object must have an id field. This field is unique to the object and immutable once created. By convention, it is a UUID though this is not enforced by this Specification. There are several expectations on this field:

  1. Content Ids are immutable. Once created the object will keep the same id for its entire lifetime.
  2. If the object is moved (e.g. stored under a different Key) it will keep the id.
  3. The same content object, i.e. the same content-id, can be referred to using different keys on different branches.

There is no API to look up an object by id and the intention of an id is not to serve in that capacity. An example usage of the id field might be storing auxiliary data on an object in a local cache and using id to look up that auxiliary data.

Note

A note about caching: The Content objects or the values of the referred information (e.g. schema, partitions etc.) might be cached locally by services using Nessie.

For content types that do not track Global State, the hash of the contents object does uniquely reference an object in the Nessie history and is a suitable key to identify an object at a particular point in its history.

Evolution of the Global State is performed in a way that keeps old contents resp. contents on different branches (and tags) available. This is the case for Apache Iceberg.

Content types that do track Global State, the Content Id must be included in the cache key.

For simplicity, it is recommeded to always include the Content Id.

Since the Content object is immutable, the hash is stable and since it is disconnected from Nessie’s version store properties it exists across commits/branches and survives GC and other table maintenance operations.

The commit hash on the other hand makes a poor cache key because multiple commits can refer to the same state of a Content object, e.g. a merge or transplant will change the commit hash but not the state of the Content object.

Content Types

Nessie is designed to support various table formats, and currently supports the following types. See also Tables & Views.

Iceberg Table

Apache Iceberg describes any table using the so called table metadata, see Iceberg Table Spec. Each Iceberg operation that modifies data, for example an append or rewrite operation or more generally each Iceberg transaction, creates a new Iceberg snapshot. Any Nessie commit refers to a particular Iceberg snapshot for an Iceberg table, which translates to the state of an Iceberg table for a particular Nessie commit.

The Nessie IcebergTable object passed to Nessie in a Put operation therefore consists of

  1. the pointer to the Iceberg table metadata and
  2. the IDs of the Iceberg snapshot, Iceberg schema, Iceberg partition spec, Iceberg sort order within the Iceberg table metadata. (so-called On Reference State)

Note

This model puts a strong restriction on the Iceberg table. All metadata JSON documents must be stored and none of the built-in iceberg maintenance procedures can be used. There are potentially serious issues regarding schema migrations in this model as well. Therefore, the Iceberg table spec should be considered subject to change in the near future.

Iceberg View

Note

Iceberg Views are experimental and subject to change!

The state of an Iceberg view is represented using the attributes versionId, schemaId, sqlText and dialect.

Iceberg views are handled similar to Iceberg Tables.

Operations in a Nessie commit

Each Nessie commit carries one or more operations. Each operation contains the Content Key and is either a Put, Delete or Unmodified operation.

A Content Key must only occur once in a Nessie commit.

Operations present in a commit are passed into Nessie as a list of operations.

Mapping SQL DDL to Nessie commit operations

A CREATE TABLE is mapped to one Put operation.

An ALTER TABLE RENAME is mapped to a Delete operation using the Content Key for the table being renamed plus at least one Put operation using the Content Key of the table’s new name, using the Content Id of the table being renamed.

A DROP TABLE is represented as a Nessie Delete operation (without a Put operation for the same Content Id).

A DROP TABLE + CREATE TABLE using the same table name (Content Key) in a single commit are mapped to one Put operation with a different Content Id.

Put operation

A Put operation modifies the state of the included Content object. It must contain the Content object and, if the Put operation modifies an existing content object, also the the expected contents. The expected contents attribute can be omitted, if the Content object refers to a new Content Id, e.g. a newly created table or view. See also Conflict Resolution.

A Nessie Put operation is created for everything that modifies a table or a view, either its definition (think: SQL DDL) or data (think: SQL DML).

Delete operation

A Delete operation does not carry any Content object and is used to indicate that a Content object is no longer referenced using the Content Key of the Delete operation.

Unmodified operation

An Unmodified operation does not represent any change of the data, but can be included in a Nessie commit operation to enforce strict serializable transactions. The presence of an Unmodified operation means that the Content object referred to via the operation’s Content Key must not have been modified since the Nessie commit’s expectedHash.

The Unmodified operation is not persisted.

Version Store

See Commit Kernel for details.

Conflict Resolution

The API passes an expectedHash parameter with a Nessie commit operation. This is the commit that the client thinks is the most up to date (its HEAD). The Nessie backend will check to see if the key has been modified since that expectedHash and if so, it will reject the requested modification with a NessieConflictException. This is basically an optimistic lock that accounts for the fact that the commit hash is global and nessie branch could have moved on from expectedHash without modifying the key in question.

A Nessie Put operation that updates an existing content object must pass the so-called expected state, which might be used to compare the current recorded state of a content object with the state in the expected state in the Put operation. If both values differ, Nessie will reject the operation with a NessieConflictException.

The reason for these conditions is to behave like a ‘real’ database. You shouldn’t have to update your reference before transacting on table A because it just happened to update table B whilst you were preparing your transaction.