Skip to content

Nessie GC

Nessie GC is a tool to clean up orphaned files in a Nessie repository. It is designed to be run periodically to keep the repository clean and to avoid unnecessary storage costs.

Requirements

The Nessie GC tool is distributed as an uber-jar and requires Java 11 or later to be available on the host where it is running.

It is also available as a Docker image, see below for more information.

The Nessie GC tool requires a running Nessie server and a JDBC-compliant database. The Nessie server must be reachable from the host where the GC tool is running. The JDBC-compliant database must also be reachable from the host where the GC tool is running. The database is used to store the live content sets and the deferred deletes.

Nessie GC has built-in support for PostgreSQL, MariaDB, MySQL (using the MariaDB driver), and H2 databases.

Note

Although the GC tool can run in in-memory mode, it is recommended to use a persistent database for production use. Any JDBC compliant database can be used, but it must be created and the schema initialized before running the Nessie GC tool.

Running locally

The Nessie GC tool can be downloaded from the GitHub Releases page, for example:

curl -L -o nessie-gc.jar https://github.com/projectnessie/nessie/releases/download/nessie-::NESSIE_VERSION::/nessie-gc-::NESSIE_VERSION::.jar

To see the available commands and options, run:

java -jar nessie-gc.jar --help

You should see the following output:

Usage: nessie-gc.jar [-hV] [COMMAND]
  -h, --help      Show this help message and exit.
  -V, --version   Print version information and exit.
Commands:
  help                           Display help information about the specified command.
  mark-live, identify, mark      Run identify-live-content phase of Nessie GC, must not be used
                                   with the in-memory contents-storage.
  sweep, expire                  Run expire-files + delete-orphan-files phase of Nessie GC using a
                                   live-contents-set stored by a previous run of the mark-live
                                   command, must not be used with the in-memory contents-storage.
  gc                             Run identify-live-content and expire-files + delete-orphan-files.
  list                           List existing live-sets, must not be used with the in-memory
                                   contents-storage.
  delete                         Delete a live-set, must not be used with the in-memory
                                   contents-storage.
  list-deferred                  List files collected as deferred deletes, must not be used with
                                   the in-memory contents-storage.
  deferred-deletes               Delete files collected as deferred deletes, must not be used with
                                   the in-memory contents-storage.
  show                           Show information of a live-content-set, must not be used with the
                                   in-memory contents-storage.
  show-sql-create-schema-script  Print DDL statements to create the schema.
  create-sql-schema              JDBC schema creation.
  completion-script              Extracts the command-line completion script.
  show-licenses                  Show 3rd party license information.

Info

Help for all Nessie GC tool commands are below on this page

The following example assumes that you have a Nessie server running at http://localhost:19120 and a PostgreSQL instance running at jdbc:postgresql://localhost:5432/nessie_gc with user pguser and password mysecretpassword.

Create the database schema if required:

java -jar nessie-gc.jar create-sql-schema \
  --jdbc-url jdbc:postgresql://localhost:5432/nessie_gc \
  --jdbc-user pguser \
  --jdbc-password mysecretpassword

Now we can run the Nessie GC tool:

java -jar nessie-gc.jar gc \
  --uri http://localhost:19120/api/v2 \
  --jdbc \
  --jdbc-url jdbc:postgresql://localhost:5432/nessie_gc \
  --jdbc-user pguser \
  --jdbc-password mysecretpassword

Running with Docker

The tool is also available as a Docker image, hosted on GitHub Container Registry. Images are also mirrored to Quay.io.

See Docker for more information.

For testing purposes, let’s create a JDBC datastore as follows:

docker run --rm -e POSTGRES_USER=pguser -e POSTGRES_PASSWORD=mysecretpassword -e POSTGRES_DB=nessie_gc -p 5432:5432 postgres:16.2

Create the database schema if required:

docker run --rm ghcr.io/projectnessie/nessie-gc:::NESSIE_VERSION:: create-sql-schema \
  --jdbc-url jdbc:postgresql://127.0.0.1:5432/nessie_gc \
  --jdbc-user pguser \
  --jdbc-password mysecretpassword

Now we can run the Nessie GC tool:

docker run --rm ghcr.io/projectnessie/nessie-gc:::NESSIE_VERSION:: gc \
  --jdbc-url jdbc:postgresql://127.0.0.1:5432/nessie_gc \
  --jdbc-user pguser \
  --jdbc-password mysecretpassword

The GC tool has a great number of options, which can be seen by running docker run --rm ghcr.io/projectnessie/nessie-gc:::NESSIE_VERSION:: --help. The main command is gc, which is followed by subcommands and options. Check the available subcommands and options by running docker run --rm ghcr.io/projectnessie/nessie-gc:::NESSIE_VERSION:: gc --help.

Running with Kubernetes

The Nessie GC tool can be executed as a Job or a CronJob in a Kubernetes cluster.

The following example assumes that you have a Nessie deployment and a PostgreSQL instance, all running in the same cluster and in the same namespace.

Create a secret for the database credentials:

kubectl create secret generic nessie-gc-credentials \
  --from-literal=JDBC_URL=jdbc:postgresql://postgresql:5432/nessie_gc \
  --from-literal=JDBC_USER=pguser \
  --from-literal=JDBC_PASSWORD=mysecretpassword

Assuming that the Nessie service is reachable at nessie:19120, create the following Kubernetes job to run the GC tool:

kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: nessie-gc-job
spec:
  template:
    spec:
      containers:
      - name: nessie-gc
        image: ghcr.io/projectnessie/nessie-gc:::NESSIE_VERSION::
        args: 
          - gc
          - --uri
          - http://nessie:19120/api/v2
          - --jdbc
          - --jdbc-url
          - "\$(JDBC_URL)"
          - --jdbc-user
          - "\$(JDBC_USER)"
          - --jdbc-password
          - "\$(JDBC_PASSWORD)"
        envFrom:
        - secretRef:
            name: nessie-gc-credentials
      restartPolicy: Never
EOF

Nessie GC Tool commands

Usage: nessie-gc.jar [-hV] [COMMAND]
  -h, --help      Show this help message and exit.
  -V, --version   Print version information and exit.
Commands:
  help                           Display help information about the specified command.
  mark-live, identify, mark      Run identify-live-content phase of Nessie GC, must not be used
                                   with the in-memory contents-storage.
  sweep, expire                  Run expire-files + delete-orphan-files phase of Nessie GC using a
                                   live-contents-set stored by a previous run of the mark-live
                                   command, must not be used with the in-memory contents-storage.
  gc                             Run identify-live-content and expire-files + delete-orphan-files.
  list                           List existing live-sets, must not be used with the in-memory
                                   contents-storage.
  delete                         Delete a live-set, must not be used with the in-memory
                                   contents-storage.
  list-deferred                  List files collected as deferred deletes, must not be used with
                                   the in-memory contents-storage.
  deferred-deletes               Delete files collected as deferred deletes, must not be used with
                                   the in-memory contents-storage.
  show                           Show information of a live-content-set, must not be used with the
                                   in-memory contents-storage.
  show-sql-create-schema-script  Print DDL statements to create the schema.
  create-sql-schema              JDBC schema creation.
  completion-script              Extracts the command-line completion script.
  show-licenses                  Show 3rd party license information.

Below is the output of the Nessie GC tool help for all commands.

mark-live, identify, mark

Usage: nessie-gc.jar mark-live [-hV] [-c=<defaultCutoffPolicy>]
                               [--identify-parallelism=<parallelism>] [--nessie-api=<nessieApi>]
                               [--nessie-client=<nessieClientName>] [-R=<cutoffPolicyRefTime>]
                               [--time-zone=<zoneId>] [-u=<nessieUri>]
                               [--write-live-set-id-to=<liveSetIdFile>] [-C[=<String=String>[,
                               <String=String>...]...]]... [-o[=<String=String>[,
                               <String=String>...]...]]... ([--inmemory] | [[--jdbc]
                               --jdbc-url=<url> [--jdbc-properties[=<String=String>[,
                               <String=String>...]...]]... [--jdbc-user=<user>]
                               [--jdbc-password=<password>] [--jdbc-schema=<schemaCreateStrategy>]])
Run identify-live-content phase of Nessie GC, must not be used with the in-memory contents-storage.
  -c, --default-cutoff=<defaultCutoffPolicy>
                             Default cutoff policy. Policies can be one of:
                             - number of commits as an integer value
                             - a duration (see java.time.Duration)
                             - an ISO instant
                             - 'NONE', means everything's considered as live
  -C, --cutoff[=<String=String>[,<String=String>...]...]
                             Cutoff policies per reference names. Supplied as a
                               ref-name-pattern=policy tuple.
                             Reference name patterns are regular expressions.
                             Policies can be one of:
                             - number of commits as an integer value
                             - a duration (see java.time.Duration)
                             - an ISO instant
                             - 'NONE', means everything's considered as live
  -h, --help                 Show this help message and exit.
      --identify-parallelism=<parallelism>
                             Number of Nessie references that can be walked in parallel.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --nessie-api=<nessieApi>
                             Class name of the NessieClientBuilder implementation to use, defaults
                               to HttpClientBuilder suitable for REST. Using this parameter is not
                               recommended. Prefer the --nessie-client parameter instead.
      --nessie-client=<nessieClientName>
                             Name of the Nessie client to use, defaults to HTTP suitable for REST.
  -o, --nessie-option[=<String=String>[,<String=String>...]...]
                             Parameters to configure the NessieClientBuilder.
  -R, --cutoff-ref-time=<cutoffPolicyRefTime>
                             Reference timestamp for durations specified for --cutoff. Defaults to
                               'now'.
      --time-zone=<zoneId>   Time zone ID used to show timestamps.
                             Defaults to system time zone.
  -u, --uri=<nessieUri>      Nessie API endpoint URI, defaults to http://localhost:19120/api/v2.
  -V, --version              Print version information and exit.
      --write-live-set-id-to=<liveSetIdFile>
                             Optional, the file name to persist the created live-set-id to.

sweep, expire

Usage: nessie-gc.jar sweep [-hV] [--[no-]defer-deletes]
                           [--allowed-fpp=<allowedFalsePositiveProbability>]
                           [--expected-file-count=<expectedFileCount>]
                           [--expiry-parallelism=<parallelism>] [--fpp=<falsePositiveProbability>]
                           [--max-file-modification=<maxFileModificationTime>]
                           [--time-zone=<zoneId>] [-H=<String=String>[,<String=String>...]]...
                           [-I=<String=String>[,<String=String>...]]... ([--inmemory] | [[--jdbc]
                           --jdbc-url=<url> [--jdbc-properties[=<String=String>[,
                           <String=String>...]...]]... [--jdbc-user=<user>]
                           [--jdbc-password=<password>] [--jdbc-schema=<schemaCreateStrategy>]])
                           (-l=<liveSetId> | -L=<liveSetIdFile>)
Run expire-files + delete-orphan-files phase of Nessie GC using a live-contents-set stored by a
previous run of the mark-live command, must not be used with the in-memory contents-storage.
      --allowed-fpp=<allowedFalsePositiveProbability>
                             The worst allowed effective false-positive-probability checked after
                               the files for a single content have been checked, defaults to 1.0E-4.
      --[no-]defer-deletes   Identified unused/orphan files are by default immediately deleted.
                               Using deferred deletion stores the files to be deleted, so the can
                               be inspected and deleted later. This option is incompatible with
                               --inmemory.
      --expected-file-count=<expectedFileCount>
                             The total number of expected live files for a single content, defaults
                               to 1000000.
      --expiry-parallelism=<parallelism>
                             Number of contents that are checked in parallel.
      --fpp=<falsePositiveProbability>
                             The false-positive-probability used to construct the bloom-filter
                               identifying whether a file is live, defaults to 1.0E-5.
  -h, --help                 Show this help message and exit.
  -H, --hadoop=<String=String>[,<String=String>...]
                             Hadoop configuration option, required when using an Iceberg FileIO
                               that is not S3FileIO.
                             The following configuration settings might be required.

                             For S3:
                             - fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
                             - fs.s3a.access.key
                             - fs.s3a.secret.key
                             - fs.s3a.endpoint, if you use an S3 compatible object store like MinIO

                             For GCS:
                             - fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
                             - fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.
                               GoogleHadoopFS
                             - fs.gs.project.id
                             - fs.gs.auth.type=USER_CREDENTIALS
                             - fs.gs.auth.client.id
                             - fs.gs.auth.client.secret
                             - fs.gs.auth.refresh.token

                             For ADLS:
                             - fs.azure.impl=org.apache.hadoop.fs.azure.AzureNativeFileSystemStore
                             - fs.AbstractFileSystem.azure.impl=org.apache.hadoop.fs.azurebfs.Abfs
                             - fs.azure.storage.emulator.account.name
                             - fs.azure.account.auth.type=SharedKey
                             - fs.azure.account.key.<account>=<base-64-encoded-secret>
  -I, --iceberg=<String=String>[,<String=String>...]
                             Iceberg properties used to configure the FileIO.
                             The following properties are almost always required.

                             For S3:
                             - s3.access-key-id
                             - s3.secret-access-key
                             - s3.endpoint, if you use an S3 compatible object store like MinIO

                             For GCS:
                             - io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO
                             - gcs.project-id
                             - gcs.oauth2.token

                             For ADLS:
                             - io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO
                             - adls.auth.shared-key.account.name
                             - adls.auth.shared-key.account.key
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
  -l, --live-set-id=<liveSetId>
                             ID of the live content set.
  -L, --read-live-set-id-from=<liveSetIdFile>
                             The file to read the live-set-id from.
      --max-file-modification=<maxFileModificationTime>
                             The maximum allowed file modification time. Files newer than this
                               timestamp will not be deleted. Defaults to the created timestamp of
                               the live-content-set.
      --time-zone=<zoneId>   Time zone ID used to show timestamps.
                             Defaults to system time zone.
  -V, --version              Print version information and exit.

gc

Usage: nessie-gc.jar gc [-hV] [--[no-]defer-deletes]
                        [--allowed-fpp=<allowedFalsePositiveProbability>]
                        [-c=<defaultCutoffPolicy>] [--expected-file-count=<expectedFileCount>]
                        [--expiry-parallelism=<parallelism>] [--fpp=<falsePositiveProbability>]
                        [--identify-parallelism=<parallelism>]
                        [--max-file-modification=<maxFileModificationTime>]
                        [--nessie-api=<nessieApi>] [--nessie-client=<nessieClientName>]
                        [-R=<cutoffPolicyRefTime>] [--time-zone=<zoneId>] [-u=<nessieUri>]
                        [--write-live-set-id-to=<liveSetIdFile>] [-H=<String=String>[,
                        <String=String>...]]... [-I=<String=String>[,<String=String>...]]... [-C
                        [=<String=String>[,<String=String>...]...]]... [-o[=<String=String>[,
                        <String=String>...]...]]... ([--inmemory] | [[--jdbc] --jdbc-url=<url>
                        [--jdbc-properties[=<String=String>[,<String=String>...]...]]...
                        [--jdbc-user=<user>] [--jdbc-password=<password>]
                        [--jdbc-schema=<schemaCreateStrategy>]])
Run identify-live-content and expire-files + delete-orphan-files.
This is the same as running a 'mark-live' + a 'sweep' command, but this variant works with the
in-memory contents storage.
      --allowed-fpp=<allowedFalsePositiveProbability>
                             The worst allowed effective false-positive-probability checked after
                               the files for a single content have been checked, defaults to 1.0E-4.
  -c, --default-cutoff=<defaultCutoffPolicy>
                             Default cutoff policy. Policies can be one of:
                             - number of commits as an integer value
                             - a duration (see java.time.Duration)
                             - an ISO instant
                             - 'NONE', means everything's considered as live
  -C, --cutoff[=<String=String>[,<String=String>...]...]
                             Cutoff policies per reference names. Supplied as a
                               ref-name-pattern=policy tuple.
                             Reference name patterns are regular expressions.
                             Policies can be one of:
                             - number of commits as an integer value
                             - a duration (see java.time.Duration)
                             - an ISO instant
                             - 'NONE', means everything's considered as live
      --[no-]defer-deletes   Identified unused/orphan files are by default immediately deleted.
                               Using deferred deletion stores the files to be deleted, so the can
                               be inspected and deleted later. This option is incompatible with
                               --inmemory.
      --expected-file-count=<expectedFileCount>
                             The total number of expected live files for a single content, defaults
                               to 1000000.
      --expiry-parallelism=<parallelism>
                             Number of contents that are checked in parallel.
      --fpp=<falsePositiveProbability>
                             The false-positive-probability used to construct the bloom-filter
                               identifying whether a file is live, defaults to 1.0E-5.
  -h, --help                 Show this help message and exit.
  -H, --hadoop=<String=String>[,<String=String>...]
                             Hadoop configuration option, required when using an Iceberg FileIO
                               that is not S3FileIO.
                             The following configuration settings might be required.

                             For S3:
                             - fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
                             - fs.s3a.access.key
                             - fs.s3a.secret.key
                             - fs.s3a.endpoint, if you use an S3 compatible object store like MinIO

                             For GCS:
                             - fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
                             - fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.
                               GoogleHadoopFS
                             - fs.gs.project.id
                             - fs.gs.auth.type=USER_CREDENTIALS
                             - fs.gs.auth.client.id
                             - fs.gs.auth.client.secret
                             - fs.gs.auth.refresh.token

                             For ADLS:
                             - fs.azure.impl=org.apache.hadoop.fs.azure.AzureNativeFileSystemStore
                             - fs.AbstractFileSystem.azure.impl=org.apache.hadoop.fs.azurebfs.Abfs
                             - fs.azure.storage.emulator.account.name
                             - fs.azure.account.auth.type=SharedKey
                             - fs.azure.account.key.<account>=<base-64-encoded-secret>
  -I, --iceberg=<String=String>[,<String=String>...]
                             Iceberg properties used to configure the FileIO.
                             The following properties are almost always required.

                             For S3:
                             - s3.access-key-id
                             - s3.secret-access-key
                             - s3.endpoint, if you use an S3 compatible object store like MinIO

                             For GCS:
                             - io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO
                             - gcs.project-id
                             - gcs.oauth2.token

                             For ADLS:
                             - io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO
                             - adls.auth.shared-key.account.name
                             - adls.auth.shared-key.account.key
      --identify-parallelism=<parallelism>
                             Number of Nessie references that can be walked in parallel.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --max-file-modification=<maxFileModificationTime>
                             The maximum allowed file modification time. Files newer than this
                               timestamp will not be deleted. Defaults to the created timestamp of
                               the live-content-set.
      --nessie-api=<nessieApi>
                             Class name of the NessieClientBuilder implementation to use, defaults
                               to HttpClientBuilder suitable for REST. Using this parameter is not
                               recommended. Prefer the --nessie-client parameter instead.
      --nessie-client=<nessieClientName>
                             Name of the Nessie client to use, defaults to HTTP suitable for REST.
  -o, --nessie-option[=<String=String>[,<String=String>...]...]
                             Parameters to configure the NessieClientBuilder.
  -R, --cutoff-ref-time=<cutoffPolicyRefTime>
                             Reference timestamp for durations specified for --cutoff. Defaults to
                               'now'.
      --time-zone=<zoneId>   Time zone ID used to show timestamps.
                             Defaults to system time zone.
  -u, --uri=<nessieUri>      Nessie API endpoint URI, defaults to http://localhost:19120/api/v2.
  -V, --version              Print version information and exit.
      --write-live-set-id-to=<liveSetIdFile>
                             Optional, the file name to persist the created live-set-id to.

list

Usage: nessie-gc.jar list [-hV] [--time-zone=<zoneId>] ([--inmemory] | [[--jdbc] --jdbc-url=<url>
                          [--jdbc-properties[=<String=String>[,<String=String>...]...]]...
                          [--jdbc-user=<user>] [--jdbc-password=<password>]
                          [--jdbc-schema=<schemaCreateStrategy>]])
List existing live-sets, must not be used with the in-memory contents-storage.
  -h, --help                 Show this help message and exit.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --time-zone=<zoneId>   Time zone ID used to show timestamps.
                             Defaults to system time zone.
  -V, --version              Print version information and exit.

delete

Usage: nessie-gc.jar delete [-hV] [--time-zone=<zoneId>] ([--inmemory] | [[--jdbc] --jdbc-url=<url>
                            [--jdbc-properties[=<String=String>[,<String=String>...]...]]...
                            [--jdbc-user=<user>] [--jdbc-password=<password>]
                            [--jdbc-schema=<schemaCreateStrategy>]]) (-l=<liveSetId> |
                            -L=<liveSetIdFile>)
Delete a live-set, must not be used with the in-memory contents-storage.
  -h, --help                 Show this help message and exit.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
  -l, --live-set-id=<liveSetId>
                             ID of the live content set.
  -L, --read-live-set-id-from=<liveSetIdFile>
                             The file to read the live-set-id from.
      --time-zone=<zoneId>   Time zone ID used to show timestamps.
                             Defaults to system time zone.
  -V, --version              Print version information and exit.

list-deferred

Usage: nessie-gc.jar list-deferred [-hV] [--time-zone=<zoneId>] ([--inmemory] | [[--jdbc]
                                   --jdbc-url=<url> [--jdbc-properties[=<String=String>[,
                                   <String=String>...]...]]... [--jdbc-user=<user>]
                                   [--jdbc-password=<password>]
                                   [--jdbc-schema=<schemaCreateStrategy>]]) (-l=<liveSetId> |
                                   -L=<liveSetIdFile>)
List files collected as deferred deletes, must not be used with the in-memory contents-storage.
  -h, --help                 Show this help message and exit.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
  -l, --live-set-id=<liveSetId>
                             ID of the live content set.
  -L, --read-live-set-id-from=<liveSetIdFile>
                             The file to read the live-set-id from.
      --time-zone=<zoneId>   Time zone ID used to show timestamps.
                             Defaults to system time zone.
  -V, --version              Print version information and exit.

deferred-deletes

Usage: nessie-gc.jar deferred-deletes [-hV] [--time-zone=<zoneId>] [-H=<String=String>[,
                                      <String=String>...]]... [-I=<String=String>[,
                                      <String=String>...]]... ([--inmemory] | [[--jdbc]
                                      --jdbc-url=<url> [--jdbc-properties[=<String=String>[,
                                      <String=String>...]...]]... [--jdbc-user=<user>]
                                      [--jdbc-password=<password>]
                                      [--jdbc-schema=<schemaCreateStrategy>]]) (-l=<liveSetId> |
                                      -L=<liveSetIdFile>)
Delete files collected as deferred deletes, must not be used with the in-memory contents-storage.
  -h, --help                 Show this help message and exit.
  -H, --hadoop=<String=String>[,<String=String>...]
                             Hadoop configuration option, required when using an Iceberg FileIO
                               that is not S3FileIO.
                             The following configuration settings might be required.

                             For S3:
                             - fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
                             - fs.s3a.access.key
                             - fs.s3a.secret.key
                             - fs.s3a.endpoint, if you use an S3 compatible object store like MinIO

                             For GCS:
                             - fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
                             - fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.
                               GoogleHadoopFS
                             - fs.gs.project.id
                             - fs.gs.auth.type=USER_CREDENTIALS
                             - fs.gs.auth.client.id
                             - fs.gs.auth.client.secret
                             - fs.gs.auth.refresh.token

                             For ADLS:
                             - fs.azure.impl=org.apache.hadoop.fs.azure.AzureNativeFileSystemStore
                             - fs.AbstractFileSystem.azure.impl=org.apache.hadoop.fs.azurebfs.Abfs
                             - fs.azure.storage.emulator.account.name
                             - fs.azure.account.auth.type=SharedKey
                             - fs.azure.account.key.<account>=<base-64-encoded-secret>
  -I, --iceberg=<String=String>[,<String=String>...]
                             Iceberg properties used to configure the FileIO.
                             The following properties are almost always required.

                             For S3:
                             - s3.access-key-id
                             - s3.secret-access-key
                             - s3.endpoint, if you use an S3 compatible object store like MinIO

                             For GCS:
                             - io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO
                             - gcs.project-id
                             - gcs.oauth2.token

                             For ADLS:
                             - io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO
                             - adls.auth.shared-key.account.name
                             - adls.auth.shared-key.account.key
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
  -l, --live-set-id=<liveSetId>
                             ID of the live content set.
  -L, --read-live-set-id-from=<liveSetIdFile>
                             The file to read the live-set-id from.
      --time-zone=<zoneId>   Time zone ID used to show timestamps.
                             Defaults to system time zone.
  -V, --version              Print version information and exit.

show

Usage: nessie-gc.jar show [-BCDhV] [--time-zone=<zoneId>] ([--inmemory] | [[--jdbc]
                          --jdbc-url=<url> [--jdbc-properties[=<String=String>[,
                          <String=String>...]...]]... [--jdbc-user=<user>]
                          [--jdbc-password=<password>] [--jdbc-schema=<schemaCreateStrategy>]])
                          (-l=<liveSetId> | -L=<liveSetIdFile>)
Show information of a live-content-set, must not be used with the in-memory contents-storage.
  -B, --with-base-locations  Show base locations.
  -C, --with-content-references
                             Show content references.
  -D, --with-deferred-deletes
                             Show deferred deletes.
  -h, --help                 Show this help message and exit.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --inmemory             Flag whether to use the in-memory contents storage. Prefer a JDBC
                               storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc                 Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-password=<password>
                             JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                             JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-schema=<schemaCreateStrategy>
                             How to create the database schema. Possible values: CREATE,
                               DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-url=<url>       JDBC URL of the database to connect to.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
      --jdbc-user=<user>     JDBC user name used to authenticate the database access.
  -l, --live-set-id=<liveSetId>
                             ID of the live content set.
  -L, --read-live-set-id-from=<liveSetIdFile>
                             The file to read the live-set-id from.
      --time-zone=<zoneId>   Time zone ID used to show timestamps.
                             Defaults to system time zone.
  -V, --version              Print version information and exit.

show-sql-create-schema-script

Usage: nessie-gc.jar show-sql-create-schema-script [-hV] [--output-file=<outputFile>]
Print DDL statements to create the schema.
  -h, --help      Show this help message and exit.
      --output-file=<outputFile>

  -V, --version   Print version information and exit.

create-sql-schema

Usage: nessie-gc.jar create-sql-schema [-hV] ([--jdbc] --jdbc-url=<url> [--jdbc-properties
                                       [=<String=String>[,<String=String>...]...]]...
                                       [--jdbc-user=<user>] [--jdbc-password=<password>]
                                       [--jdbc-schema=<schemaCreateStrategy>])
JDBC schema creation.
  -h, --help               Show this help message and exit.
      --jdbc               Flag whether to use the JDBC contents storage.
      --jdbc-password=<password>
                           JDBC password used to authenticate the database access.
      --jdbc-properties[=<String=String>[,<String=String>...]...]
                           JDBC parameters.
      --jdbc-schema=<schemaCreateStrategy>
                           How to create the database schema. Possible values: CREATE,
                             DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
      --jdbc-url=<url>     JDBC URL of the database to connect to.
      --jdbc-user=<user>   JDBC user name used to authenticate the database access.
  -V, --version            Print version information and exit.

completion-script

Usage: nessie-gc.jar completion-script [-hV] -O=<outputFile>
Extracts the command-line completion script.
  -h, --help      Show this help message and exit.
  -O, --output-file=<outputFile>
                  Completion script file name.
  -V, --version   Print version information and exit.

Nessie GC for Nessie Administrators

Please refer to the Garbage Collection documentation for information on how to run the Nessie GC on a regular basis in production.

Nessie GC Internals

The rest of this document describes the internals of the Nessie GC tool and is intended for developers who want to understand how the tool works.

The GC tool consists of a gc-base module, which contains the general base functionality to access a repository to identify the live contents, to identify the live files, to list the existing files and to purge orphan files.

Modules that supplement the gc-base module:

  • gc-iceberg implements the Iceberg table-format specific functionality.
  • gc-iceberg-files implements file listing + deletion using Iceberg’s FileIO.
  • gc-iceberg-mock is a testing-only module to generate mock metadata, manifest-lists, manifests and (empty) data files.
  • gc-repository-jdbc implements the live-content-sets-store using JDBC (PostgreSQL, MariaDB, MySQL and any other compatible database).
  • s3mock is a testing-only module containing a S3 mock backend that allows listing objects and getting objects programmatically.
  • s3mino is a junit 5 test extension providing a Minio based S3 backend.

The gc-tool module is a command-line interface, a standalone tool provided as an executable, it is an uber jar prefixed with a shell script, and can still be executed with java -jar ....

Basic Nessie-GC functionality

Nessie-GC implements a mark-and-sweep approach, a two-phase process:

The “mark phase”, or “live content identification”, walks all named references and collects references to all Content objects that are considered as live. Those references are stored in a repository as a “live contents set”. The “mark phase” is implemented in IdentifyLiveContents.

The “sweep phase”, or “delete orphan files”, operates per content-id. For each content, all live versions of a Content are scanned to identify the set of live data files. After that, the base-location(s) are scanned and all files that are not in the set of live data files are deleted. The “sweep phase” is implemented by DefaultLocalExpire.

Inner workings

To minimize the amount of data needed to match against the set of live data files for a Content, the implementation does not actually remember all individual data files, like maintaining a java.util.Set of all those data files, but remembers all data files in a bloom filter.

Both the “mark” (identify live contents) and “sweep” (identify and delete expired contents) phases provide a configurable parallelism: the number of concurrently scanned named references can be configured and the amount of concurrently processed tables can be configured.

Mark phase optimization

The implementation that walks the commit logs can be configured with a VisitedDeduplicator, which is meant to reduce the work required during the “mark” phase, if the commit to be examined has already been processed.

There is a DefaultVisitedDeduplicator implementation, but it is likely that it requires too much memory during runtime, especially when the identify-run is configured with multiple GC policies and/or has to walk many commits. This DefaultVisitedDeduplicator is present, but due to the mentioned concerns not available in the Nessie GC tool and the use of DefaultVisitedDeduplicator is not supported at all, and not recommended.

Identified live contents repository

It is recommended to use an external database for the Nessie GC repository. This is especially recommended for big Nessie repositories.

Nessie GC runs against small-ish repositories do technically work with an in-memory repository. But, as the term “in memory” suggests, the identified live-contents-set, its state, duration, etc. cannot be inspected afterwards.

Pluggable code

Different parts / functionalities are quite isolated and abstracted to allow proper unit-testability and also allow reuse of similar functionality.

Examples of abstracted/isolated functionality:

  • Functionality to recursively walk a base location
  • Functionality to delete files
  • Nessie GC repository
  • Getting all data files for a specific content reference (think: Iceberg table snapshot)
  • Commit-log-scanning duplicate work elimination

File references

All files (or objects, in case of an object store like S3) are described using a FileReference, using a base URI plus a URI relative to the base URI. Noteworthy: the “sweep phase”, which “collects” all live files in a bloom filter and after that lists files in all base URIs, always uses only the relative URI, never the full URI, to check whether a file is orphan or probably not (bloom filter is probabilistic data structure).

Since object stores are the primary target, only files but not directories are supported. Object stores do not know about directories, further Iceberg’s FileIO does not know about directories either. For file systems that do support directories this means, that empty directories will not be deleted, and prematurely deleting directories could break concurrent operations.

Runtime requirements

Nessie GC work is dominated by network and/or disk I/O, less by CPU and heap pressure.

Memory requirements (rough estimates):

  • Number of concurrent content-scans (“sweep phase”) times the bloom-filter on-heap size (assume that can be a couple MB, depending on the expected number of files and allowed false-positive ratio).
  • Duplicate-commit-log-walk elimination requires some amount of memory for each distinct cut-off time times the (possible) number of commits over the matching references.
  • Additional memory is required for the currently processed chunks of metadata, for example Iceberg’s table-metadata and manifests, to identify the live data files. (The raw metadata is only read and processed, but not memoized.)
  • An in-memory live-contents-repository (not recommended for production workloads) requires memory for all content-references.

CPU & heap pressure testing

Special “tests” (this and (this) have been used to verify that even a huge amount of objects does not let a tiny Java heap “explode” and not use excessive CPU resources. This “test” simulates a Nessie repository with many references, commits, contents and files per content version. Runs of that test using a profiler proved that the implementation requires little memory and little CPU - runtime is largely dominated by bloom-filter put and maybe-contains operations for the per-content-expire runs. Both tests proved the concept.

Deferred deletion

The default behavior is to immediately deletes orphan files. But it is also possible to record the files to be deleted and delete those later. The nessie-gc.jar tool supports deferred deletion.

Non-Nessie use cases

Although all the above is designed for Nessie, it is possible to reuse the core implementation with “plain” Iceberg, effectively a complete replacement of Iceberg’s expire snapshots and delete orphan files, but without Iceberg’s implicit requirement of using Spark. Things needed for this:

  • A “pure Iceberg” implementation of org.projectnessie.gc.repository.RepositoryConnector:
  • Return one reference per Iceberg table, derived from the underlying Iceberg catalog.
  • Provide a commit log with one Put operation for each Iceberg snapshot.
  • (The allContents function can return an empty Streamfor the “pure Iceberg” use case.)
  • Existing functionality, the mark-and-sweep logic and the code in nessie-gc-iceberg and nessie-gc-iceberg-files, can be reused without any changes.

Potential future enhancements

Since Nessie GC keeps track of all ever live content-references and all ever known base content locations, it is possible to identify …

  • … the base content locations that are no longer used. In other words: storage of e.g. Iceberg tables that have been deleted and are no longer referenced in any live Nessie commit.
  • … the content references (aka Iceberg snapshots) are no longer used. This information can be used to no longer expose the affected e.g. Iceberg snapshots in any table metadata.

Completely unreferenced contents

Files of contents that are not visible from any live Nessie commit can be completely removed. Detecting this situation is not directly supported by the above approach.

The live-contents-set generated by Nessie GC’s identify phase contains all content IDs that are “live”. Nessie (server) could (and this approach is really just a thought) help here, by sending the “live” content IDs to Nessie and Nessie returning one content object for all content IDs that are not contained in the set of “live” content IDs. Another implementation would then be responsible to inspect the returned contents and purge the base locations in the data lake, where the data files, manifests, etc were stored.

The above must not purge files for content IDs that have just been recently created.

Potential Iceberg specific enhancements

Nessie GC can easily identify the Iceberg snapshots, as each Nessie commit references exactly one Iceberg table snapshot. Nessie (the runtime/server) has no knowledge of whether a particular Iceberg partition-spec, sort-order or schema is used via any live Iceberg snapshot or not, because partition-specs, sort-orders and schemas are referenced via Iceberg manifests. Although Nessie GC can identify these three kinds of structures during the identification of live contents, the expire phase, that maps content references to files, does not respect Nessie commit order.

So it is necessary to fully think through a potential “expire specs/sort-orders/schemas”, keeping in mind:

  • Commit-chains are important, because IDs of partition-specs, sort-orders and schemas are assigned sequentially (a 32 bit int).
  • The logically same partition-specs, sort-orders and schemas may exist with different IDs on different Nessie references.
  • Partition-specs, sort-orders and schemas are only maintained in table-metadata, but referenced from “deeper” structures (manifest list, manifest files, data files).
  • Is it really worth to have an “expire specs/sort-orders/schemas”.