Nessie GC¶
Nessie GC is a tool to clean up orphaned files in a Nessie repository. It is designed to be run periodically to keep the repository clean and to avoid unnecessary storage costs.
Requirements¶
The Nessie GC tool is distributed as an uber-jar and requires Java 11 or later to be available on the host where it is running.
It is also available as a Docker image, see below for more information.
The Nessie GC tool requires a running Nessie server and a JDBC-compliant database. The Nessie server must be reachable from the host where the GC tool is running. The JDBC-compliant database must also be reachable from the host where the GC tool is running. The database is used to store the live content sets and the deferred deletes.
Nessie GC has built-in support for PostgreSQL, MariaDB, MySQL (using the MariaDB driver), and H2 databases.
Note
Although the GC tool can run in in-memory mode, it is recommended to use a persistent database for production use. Any JDBC compliant database can be used, but it must be created and the schema initialized before running the Nessie GC tool.
Running the standalone uber jar¶
Check download options in the Nessie download page.
To see the available commands and options, run:
java -jar nessie-gc.jar --help
You should see the following output:
Usage: nessie-gc.jar [-hV] [COMMAND]
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Commands:
help Display help information about the specified command.
mark-live, identify, mark Run identify-live-content phase of Nessie GC, must not be used
with the in-memory contents-storage.
sweep, expire Run expire-files + delete-orphan-files phase of Nessie GC using a
live-contents-set stored by a previous run of the mark-live
command, must not be used with the in-memory contents-storage.
gc Run identify-live-content and expire-files + delete-orphan-files.
list List existing live-sets, must not be used with the in-memory
contents-storage.
delete Delete a live-set, must not be used with the in-memory
contents-storage.
list-deferred List files collected as deferred deletes, must not be used with
the in-memory contents-storage.
deferred-deletes Delete files collected as deferred deletes, must not be used with
the in-memory contents-storage.
show Show information of a live-content-set, must not be used with the
in-memory contents-storage.
show-sql-create-schema-script Print DDL statements to create the schema.
create-sql-schema JDBC schema creation.
completion-script Extracts the command-line completion script.
show-licenses Show 3rd party license information.
Info
Help for all Nessie GC tool commands are below on this page
The following example assumes that you have a Nessie server running at http://localhost:19120
and a PostgreSQL instance running at jdbc:postgresql://localhost:5432/nessie_gc
with user pguser
and password mysecretpassword
.
Create the database schema if required:
java -jar nessie-gc.jar create-sql-schema \
--jdbc-url jdbc:postgresql://localhost:5432/nessie_gc \
--jdbc-user pguser \
--jdbc-password mysecretpassword
Now we can run the Nessie GC tool:
java -jar nessie-gc.jar gc \
--uri http://localhost:19120/api/v2 \
--jdbc \
--jdbc-url jdbc:postgresql://localhost:5432/nessie_gc \
--jdbc-user pguser \
--jdbc-password mysecretpassword
Running with Docker¶
The tool is also available as a Docker image, hosted on GitHub Container Registry. Images are also mirrored to Quay.io.
See Docker for more information.
For testing purposes, let’s create a JDBC datastore as follows:
docker run --rm -e POSTGRES_USER=pguser -e POSTGRES_PASSWORD=mysecretpassword -e POSTGRES_DB=nessie_gc -p 5432:5432 postgres:16.2
Create the database schema if required:
docker run --rm ghcr.io/projectnessie/nessie-gc:0.94.2 create-sql-schema \
--jdbc-url jdbc:postgresql://127.0.0.1:5432/nessie_gc \
--jdbc-user pguser \
--jdbc-password mysecretpassword
Now we can run the Nessie GC tool:
docker run --rm ghcr.io/projectnessie/nessie-gc:0.94.2 gc \
--jdbc-url jdbc:postgresql://127.0.0.1:5432/nessie_gc \
--jdbc-user pguser \
--jdbc-password mysecretpassword
The GC tool has a great number of options, which can be seen by running docker run --rm ghcr.io/projectnessie/nessie-gc:0.94.2 --help
. The main command is gc
, which is followed by subcommands and options. Check the available subcommands and options by running docker run --rm ghcr.io/projectnessie/nessie-gc:0.94.2 gc --help
.
Running with Kubernetes¶
The Nessie GC tool can be executed as a Job or a CronJob in a Kubernetes cluster.
The following example assumes that you have a Nessie deployment and a PostgreSQL instance, all running in the same cluster and in the same namespace.
Create a secret for the database credentials:
kubectl create secret generic nessie-gc-credentials \
--from-literal=JDBC_URL=jdbc:postgresql://postgresql:5432/nessie_gc \
--from-literal=JDBC_USER=pguser \
--from-literal=JDBC_PASSWORD=mysecretpassword
Assuming that the Nessie service is reachable at nessie:19120
, create the following Kubernetes job to run the GC tool:
kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: nessie-gc-job
spec:
template:
spec:
containers:
- name: nessie-gc
image: ghcr.io/projectnessie/nessie-gc:0.94.2
args:
- gc
- --uri
- http://nessie:19120/api/v2
- --jdbc
- --jdbc-url
- "\$(JDBC_URL)"
- --jdbc-user
- "\$(JDBC_USER)"
- --jdbc-password
- "\$(JDBC_PASSWORD)"
envFrom:
- secretRef:
name: nessie-gc-credentials
restartPolicy: Never
EOF
Nessie GC Tool commands¶
Usage: nessie-gc.jar [-hV] [COMMAND]
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Commands:
help Display help information about the specified command.
mark-live, identify, mark Run identify-live-content phase of Nessie GC, must not be used
with the in-memory contents-storage.
sweep, expire Run expire-files + delete-orphan-files phase of Nessie GC using a
live-contents-set stored by a previous run of the mark-live
command, must not be used with the in-memory contents-storage.
gc Run identify-live-content and expire-files + delete-orphan-files.
list List existing live-sets, must not be used with the in-memory
contents-storage.
delete Delete a live-set, must not be used with the in-memory
contents-storage.
list-deferred List files collected as deferred deletes, must not be used with
the in-memory contents-storage.
deferred-deletes Delete files collected as deferred deletes, must not be used with
the in-memory contents-storage.
show Show information of a live-content-set, must not be used with the
in-memory contents-storage.
show-sql-create-schema-script Print DDL statements to create the schema.
create-sql-schema JDBC schema creation.
completion-script Extracts the command-line completion script.
show-licenses Show 3rd party license information.
Below is the output of the Nessie GC tool help for all commands.
mark-live
, identify
, mark
¶
Usage: nessie-gc.jar mark-live [-hV] [-c=<defaultCutoffPolicy>]
[--identify-parallelism=<parallelism>] [--nessie-api=<nessieApi>]
[--nessie-client=<nessieClientName>] [-R=<cutoffPolicyRefTime>]
[--time-zone=<zoneId>] [-u=<nessieUri>]
[--write-live-set-id-to=<liveSetIdFile>] [-C[=<String=String>[,
<String=String>...]...]]... [-o[=<String=String>[,
<String=String>...]...]]... ([--inmemory] | [[--jdbc]
--jdbc-url=<url> [--jdbc-properties[=<String=String>[,
<String=String>...]...]]... [--jdbc-user=<user>]
[--jdbc-password=<password>] [--jdbc-schema=<schemaCreateStrategy>]])
Run identify-live-content phase of Nessie GC, must not be used with the in-memory contents-storage.
-c, --default-cutoff=<defaultCutoffPolicy>
Default cutoff policy. Policies can be one of:
- number of commits as an integer value
- a duration (see java.time.Duration)
- an ISO instant
- 'NONE', means everything's considered as live
-C, --cutoff[=<String=String>[,<String=String>...]...]
Cutoff policies per reference names. Supplied as a
ref-name-pattern=policy tuple.
Reference name patterns are regular expressions.
Policies can be one of:
- number of commits as an integer value
- a duration (see java.time.Duration)
- an ISO instant
- 'NONE', means everything's considered as live
-h, --help Show this help message and exit.
--identify-parallelism=<parallelism>
Number of Nessie references that can be walked in parallel.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--nessie-api=<nessieApi>
Class name of the NessieClientBuilder implementation to use, defaults
to HttpClientBuilder suitable for REST. Using this parameter is not
recommended. Prefer the --nessie-client parameter instead.
--nessie-client=<nessieClientName>
Name of the Nessie client to use, defaults to HTTP suitable for REST.
-o, --nessie-option[=<String=String>[,<String=String>...]...]
Parameters to configure the NessieClientBuilder.
-R, --cutoff-ref-time=<cutoffPolicyRefTime>
Reference timestamp for durations specified for --cutoff. Defaults to
'now'.
--time-zone=<zoneId> Time zone ID used to show timestamps.
Defaults to system time zone.
-u, --uri=<nessieUri> Nessie API endpoint URI, defaults to http://localhost:19120/api/v2.
-V, --version Print version information and exit.
--write-live-set-id-to=<liveSetIdFile>
Optional, the file name to persist the created live-set-id to.
sweep
, expire
¶
Usage: nessie-gc.jar sweep [-hV] [--[no-]defer-deletes]
[--allowed-fpp=<allowedFalsePositiveProbability>]
[--expected-file-count=<expectedFileCount>]
[--expiry-parallelism=<parallelism>] [--fpp=<falsePositiveProbability>]
[--max-file-modification=<maxFileModificationTime>]
[--time-zone=<zoneId>] [-H=<String=String>[,<String=String>...]]...
[-I=<String=String>[,<String=String>...]]... ([--inmemory] | [[--jdbc]
--jdbc-url=<url> [--jdbc-properties[=<String=String>[,
<String=String>...]...]]... [--jdbc-user=<user>]
[--jdbc-password=<password>] [--jdbc-schema=<schemaCreateStrategy>]])
(-l=<liveSetId> | -L=<liveSetIdFile>)
Run expire-files + delete-orphan-files phase of Nessie GC using a live-contents-set stored by a
previous run of the mark-live command, must not be used with the in-memory contents-storage.
--allowed-fpp=<allowedFalsePositiveProbability>
The worst allowed effective false-positive-probability checked after
the files for a single content have been checked, defaults to 1.0E-4.
--[no-]defer-deletes Identified unused/orphan files are by default immediately deleted.
Using deferred deletion stores the files to be deleted, so the can
be inspected and deleted later. This option is incompatible with
--inmemory.
--expected-file-count=<expectedFileCount>
The total number of expected live files for a single content, defaults
to 1000000.
--expiry-parallelism=<parallelism>
Number of contents that are checked in parallel.
--fpp=<falsePositiveProbability>
The false-positive-probability used to construct the bloom-filter
identifying whether a file is live, defaults to 1.0E-5.
-h, --help Show this help message and exit.
-H, --hadoop=<String=String>[,<String=String>...]
Hadoop configuration option, required when using an Iceberg FileIO
that is not S3FileIO.
The following configuration settings might be required.
For S3:
- fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
- fs.s3a.access.key
- fs.s3a.secret.key
- fs.s3a.endpoint, if you use an S3 compatible object store like MinIO
For GCS:
- fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.
GoogleHadoopFS
- fs.gs.project.id
- fs.gs.auth.type=USER_CREDENTIALS
- fs.gs.auth.client.id
- fs.gs.auth.client.secret
- fs.gs.auth.refresh.token
For ADLS:
- fs.azure.impl=org.apache.hadoop.fs.azure.AzureNativeFileSystemStore
- fs.AbstractFileSystem.azure.impl=org.apache.hadoop.fs.azurebfs.Abfs
- fs.azure.storage.emulator.account.name
- fs.azure.account.auth.type=SharedKey
- fs.azure.account.key.<account>=<base-64-encoded-secret>
-I, --iceberg=<String=String>[,<String=String>...]
Iceberg properties used to configure the FileIO.
The following properties are almost always required.
For S3:
- s3.access-key-id
- s3.secret-access-key
- s3.endpoint, if you use an S3 compatible object store like MinIO
For GCS:
- io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO
- gcs.project-id
- gcs.oauth2.token
For ADLS:
- io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO
- adls.auth.shared-key.account.name
- adls.auth.shared-key.account.key
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
-l, --live-set-id=<liveSetId>
ID of the live content set.
-L, --read-live-set-id-from=<liveSetIdFile>
The file to read the live-set-id from.
--max-file-modification=<maxFileModificationTime>
The maximum allowed file modification time. Files newer than this
timestamp will not be deleted. Defaults to the created timestamp of
the live-content-set.
--time-zone=<zoneId> Time zone ID used to show timestamps.
Defaults to system time zone.
-V, --version Print version information and exit.
gc
¶
Usage: nessie-gc.jar gc [-hV] [--[no-]defer-deletes]
[--allowed-fpp=<allowedFalsePositiveProbability>]
[-c=<defaultCutoffPolicy>] [--expected-file-count=<expectedFileCount>]
[--expiry-parallelism=<parallelism>] [--fpp=<falsePositiveProbability>]
[--identify-parallelism=<parallelism>]
[--max-file-modification=<maxFileModificationTime>]
[--nessie-api=<nessieApi>] [--nessie-client=<nessieClientName>]
[-R=<cutoffPolicyRefTime>] [--time-zone=<zoneId>] [-u=<nessieUri>]
[--write-live-set-id-to=<liveSetIdFile>] [-H=<String=String>[,
<String=String>...]]... [-I=<String=String>[,<String=String>...]]... [-C
[=<String=String>[,<String=String>...]...]]... [-o[=<String=String>[,
<String=String>...]...]]... ([--inmemory] | [[--jdbc] --jdbc-url=<url>
[--jdbc-properties[=<String=String>[,<String=String>...]...]]...
[--jdbc-user=<user>] [--jdbc-password=<password>]
[--jdbc-schema=<schemaCreateStrategy>]])
Run identify-live-content and expire-files + delete-orphan-files.
This is the same as running a 'mark-live' + a 'sweep' command, but this variant works with the
in-memory contents storage.
--allowed-fpp=<allowedFalsePositiveProbability>
The worst allowed effective false-positive-probability checked after
the files for a single content have been checked, defaults to 1.0E-4.
-c, --default-cutoff=<defaultCutoffPolicy>
Default cutoff policy. Policies can be one of:
- number of commits as an integer value
- a duration (see java.time.Duration)
- an ISO instant
- 'NONE', means everything's considered as live
-C, --cutoff[=<String=String>[,<String=String>...]...]
Cutoff policies per reference names. Supplied as a
ref-name-pattern=policy tuple.
Reference name patterns are regular expressions.
Policies can be one of:
- number of commits as an integer value
- a duration (see java.time.Duration)
- an ISO instant
- 'NONE', means everything's considered as live
--[no-]defer-deletes Identified unused/orphan files are by default immediately deleted.
Using deferred deletion stores the files to be deleted, so the can
be inspected and deleted later. This option is incompatible with
--inmemory.
--expected-file-count=<expectedFileCount>
The total number of expected live files for a single content, defaults
to 1000000.
--expiry-parallelism=<parallelism>
Number of contents that are checked in parallel.
--fpp=<falsePositiveProbability>
The false-positive-probability used to construct the bloom-filter
identifying whether a file is live, defaults to 1.0E-5.
-h, --help Show this help message and exit.
-H, --hadoop=<String=String>[,<String=String>...]
Hadoop configuration option, required when using an Iceberg FileIO
that is not S3FileIO.
The following configuration settings might be required.
For S3:
- fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
- fs.s3a.access.key
- fs.s3a.secret.key
- fs.s3a.endpoint, if you use an S3 compatible object store like MinIO
For GCS:
- fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.
GoogleHadoopFS
- fs.gs.project.id
- fs.gs.auth.type=USER_CREDENTIALS
- fs.gs.auth.client.id
- fs.gs.auth.client.secret
- fs.gs.auth.refresh.token
For ADLS:
- fs.azure.impl=org.apache.hadoop.fs.azure.AzureNativeFileSystemStore
- fs.AbstractFileSystem.azure.impl=org.apache.hadoop.fs.azurebfs.Abfs
- fs.azure.storage.emulator.account.name
- fs.azure.account.auth.type=SharedKey
- fs.azure.account.key.<account>=<base-64-encoded-secret>
-I, --iceberg=<String=String>[,<String=String>...]
Iceberg properties used to configure the FileIO.
The following properties are almost always required.
For S3:
- s3.access-key-id
- s3.secret-access-key
- s3.endpoint, if you use an S3 compatible object store like MinIO
For GCS:
- io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO
- gcs.project-id
- gcs.oauth2.token
For ADLS:
- io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO
- adls.auth.shared-key.account.name
- adls.auth.shared-key.account.key
--identify-parallelism=<parallelism>
Number of Nessie references that can be walked in parallel.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--max-file-modification=<maxFileModificationTime>
The maximum allowed file modification time. Files newer than this
timestamp will not be deleted. Defaults to the created timestamp of
the live-content-set.
--nessie-api=<nessieApi>
Class name of the NessieClientBuilder implementation to use, defaults
to HttpClientBuilder suitable for REST. Using this parameter is not
recommended. Prefer the --nessie-client parameter instead.
--nessie-client=<nessieClientName>
Name of the Nessie client to use, defaults to HTTP suitable for REST.
-o, --nessie-option[=<String=String>[,<String=String>...]...]
Parameters to configure the NessieClientBuilder.
-R, --cutoff-ref-time=<cutoffPolicyRefTime>
Reference timestamp for durations specified for --cutoff. Defaults to
'now'.
--time-zone=<zoneId> Time zone ID used to show timestamps.
Defaults to system time zone.
-u, --uri=<nessieUri> Nessie API endpoint URI, defaults to http://localhost:19120/api/v2.
-V, --version Print version information and exit.
--write-live-set-id-to=<liveSetIdFile>
Optional, the file name to persist the created live-set-id to.
list
¶
Usage: nessie-gc.jar list [-hV] [--time-zone=<zoneId>] ([--inmemory] | [[--jdbc] --jdbc-url=<url>
[--jdbc-properties[=<String=String>[,<String=String>...]...]]...
[--jdbc-user=<user>] [--jdbc-password=<password>]
[--jdbc-schema=<schemaCreateStrategy>]])
List existing live-sets, must not be used with the in-memory contents-storage.
-h, --help Show this help message and exit.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--time-zone=<zoneId> Time zone ID used to show timestamps.
Defaults to system time zone.
-V, --version Print version information and exit.
delete
¶
Usage: nessie-gc.jar delete [-hV] [--time-zone=<zoneId>] ([--inmemory] | [[--jdbc] --jdbc-url=<url>
[--jdbc-properties[=<String=String>[,<String=String>...]...]]...
[--jdbc-user=<user>] [--jdbc-password=<password>]
[--jdbc-schema=<schemaCreateStrategy>]]) (-l=<liveSetId> |
-L=<liveSetIdFile>)
Delete a live-set, must not be used with the in-memory contents-storage.
-h, --help Show this help message and exit.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
-l, --live-set-id=<liveSetId>
ID of the live content set.
-L, --read-live-set-id-from=<liveSetIdFile>
The file to read the live-set-id from.
--time-zone=<zoneId> Time zone ID used to show timestamps.
Defaults to system time zone.
-V, --version Print version information and exit.
list-deferred
¶
Usage: nessie-gc.jar list-deferred [-hV] [--time-zone=<zoneId>] ([--inmemory] | [[--jdbc]
--jdbc-url=<url> [--jdbc-properties[=<String=String>[,
<String=String>...]...]]... [--jdbc-user=<user>]
[--jdbc-password=<password>]
[--jdbc-schema=<schemaCreateStrategy>]]) (-l=<liveSetId> |
-L=<liveSetIdFile>)
List files collected as deferred deletes, must not be used with the in-memory contents-storage.
-h, --help Show this help message and exit.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
-l, --live-set-id=<liveSetId>
ID of the live content set.
-L, --read-live-set-id-from=<liveSetIdFile>
The file to read the live-set-id from.
--time-zone=<zoneId> Time zone ID used to show timestamps.
Defaults to system time zone.
-V, --version Print version information and exit.
deferred-deletes
¶
Usage: nessie-gc.jar deferred-deletes [-hV] [--time-zone=<zoneId>] [-H=<String=String>[,
<String=String>...]]... [-I=<String=String>[,
<String=String>...]]... ([--inmemory] | [[--jdbc]
--jdbc-url=<url> [--jdbc-properties[=<String=String>[,
<String=String>...]...]]... [--jdbc-user=<user>]
[--jdbc-password=<password>]
[--jdbc-schema=<schemaCreateStrategy>]]) (-l=<liveSetId> |
-L=<liveSetIdFile>)
Delete files collected as deferred deletes, must not be used with the in-memory contents-storage.
-h, --help Show this help message and exit.
-H, --hadoop=<String=String>[,<String=String>...]
Hadoop configuration option, required when using an Iceberg FileIO
that is not S3FileIO.
The following configuration settings might be required.
For S3:
- fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
- fs.s3a.access.key
- fs.s3a.secret.key
- fs.s3a.endpoint, if you use an S3 compatible object store like MinIO
For GCS:
- fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
- fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.
GoogleHadoopFS
- fs.gs.project.id
- fs.gs.auth.type=USER_CREDENTIALS
- fs.gs.auth.client.id
- fs.gs.auth.client.secret
- fs.gs.auth.refresh.token
For ADLS:
- fs.azure.impl=org.apache.hadoop.fs.azure.AzureNativeFileSystemStore
- fs.AbstractFileSystem.azure.impl=org.apache.hadoop.fs.azurebfs.Abfs
- fs.azure.storage.emulator.account.name
- fs.azure.account.auth.type=SharedKey
- fs.azure.account.key.<account>=<base-64-encoded-secret>
-I, --iceberg=<String=String>[,<String=String>...]
Iceberg properties used to configure the FileIO.
The following properties are almost always required.
For S3:
- s3.access-key-id
- s3.secret-access-key
- s3.endpoint, if you use an S3 compatible object store like MinIO
For GCS:
- io-impl=org.apache.iceberg.gcp.gcs.GCSFileIO
- gcs.project-id
- gcs.oauth2.token
For ADLS:
- io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO
- adls.auth.shared-key.account.name
- adls.auth.shared-key.account.key
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
-l, --live-set-id=<liveSetId>
ID of the live content set.
-L, --read-live-set-id-from=<liveSetIdFile>
The file to read the live-set-id from.
--time-zone=<zoneId> Time zone ID used to show timestamps.
Defaults to system time zone.
-V, --version Print version information and exit.
show
¶
Usage: nessie-gc.jar show [-BCDhV] [--time-zone=<zoneId>] ([--inmemory] | [[--jdbc]
--jdbc-url=<url> [--jdbc-properties[=<String=String>[,
<String=String>...]...]]... [--jdbc-user=<user>]
[--jdbc-password=<password>] [--jdbc-schema=<schemaCreateStrategy>]])
(-l=<liveSetId> | -L=<liveSetIdFile>)
Show information of a live-content-set, must not be used with the in-memory contents-storage.
-B, --with-base-locations Show base locations.
-C, --with-content-references
Show content references.
-D, --with-deferred-deletes
Show deferred deletes.
-h, --help Show this help message and exit.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--inmemory Flag whether to use the in-memory contents storage. Prefer a JDBC
storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
-l, --live-set-id=<liveSetId>
ID of the live content set.
-L, --read-live-set-id-from=<liveSetIdFile>
The file to read the live-set-id from.
--time-zone=<zoneId> Time zone ID used to show timestamps.
Defaults to system time zone.
-V, --version Print version information and exit.
show-sql-create-schema-script
¶
Usage: nessie-gc.jar show-sql-create-schema-script [-hV] [--output-file=<outputFile>]
Print DDL statements to create the schema.
-h, --help Show this help message and exit.
--output-file=<outputFile>
-V, --version Print version information and exit.
create-sql-schema
¶
Usage: nessie-gc.jar create-sql-schema [-hV] ([--jdbc] --jdbc-url=<url> [--jdbc-properties
[=<String=String>[,<String=String>...]...]]...
[--jdbc-user=<user>] [--jdbc-password=<password>]
[--jdbc-schema=<schemaCreateStrategy>])
JDBC schema creation.
-h, --help Show this help message and exit.
--jdbc Flag whether to use the JDBC contents storage.
--jdbc-password=<password>
JDBC password used to authenticate the database access.
--jdbc-properties[=<String=String>[,<String=String>...]...]
JDBC parameters.
--jdbc-schema=<schemaCreateStrategy>
How to create the database schema. Possible values: CREATE,
DROP_AND_CREATE, CREATE_IF_NOT_EXISTS.
--jdbc-url=<url> JDBC URL of the database to connect to.
--jdbc-user=<user> JDBC user name used to authenticate the database access.
-V, --version Print version information and exit.
completion-script
¶
Usage: nessie-gc.jar completion-script [-hV] -O=<outputFile>
Extracts the command-line completion script.
-h, --help Show this help message and exit.
-O, --output-file=<outputFile>
Completion script file name.
-V, --version Print version information and exit.
Nessie GC for Nessie Administrators¶
Please refer to the Garbage Collection documentation for information on how to run the Nessie GC on a regular basis in production.
Nessie GC Internals¶
The rest of this document describes the internals of the Nessie GC tool and is intended for developers who want to understand how the tool works.
The GC tool consists of a gc-base
module, which contains the general base functionality to access a repository to identify the live contents, to identify the live files, to list the existing files and to purge orphan files.
Modules that supplement the gc-base
module:
gc-iceberg
implements the Iceberg table-format specific functionality.gc-iceberg-files
implements file listing + deletion using Iceberg’sFileIO
.gc-iceberg-mock
is a testing-only module to generate mock metadata, manifest-lists, manifests and (empty) data files.gc-repository-jdbc
implements the live-content-sets-store using JDBC (PostgreSQL, MariaDB, MySQL and any other compatible database).s3mock
is a testing-only module containing a S3 mock backend that allows listing objects and getting objects programmatically.s3mino
is a junit 5 test extension providing a Minio based S3 backend.
The gc-tool
module is a command-line interface, a standalone tool provided as an executable, it is an uber jar prefixed with a shell script, and can still be executed with java -jar ...
.
Basic Nessie-GC functionality¶
Nessie-GC implements a mark-and-sweep approach, a two-phase process:
The “mark phase”, or “live content identification”, walks all named references and collects references to all Content
objects that are considered as live. Those references are stored in a repository as a “live contents set”. The “mark phase” is implemented in IdentifyLiveContents
.
The “sweep phase”, or “delete orphan files”, operates per content-id. For each content, all live versions of a Content
are scanned to identify the set of live data files. After that, the base-location(s) are scanned and all files that are not in the set of live data files are deleted. The “sweep phase” is implemented by DefaultLocalExpire
.
Inner workings¶
To minimize the amount of data needed to match against the set of live data files for a Content
, the implementation does not actually remember all individual data files, like maintaining a java.util.Set
of all those data files, but remembers all data files in a bloom filter.
Both the “mark” (identify live contents) and “sweep” (identify and delete expired contents) phases provide a configurable parallelism: the number of concurrently scanned named references can be configured and the amount of concurrently processed tables can be configured.
Mark phase optimization¶
The implementation that walks the commit logs can be configured with a VisitedDeduplicator
, which is meant to reduce the work required during the “mark” phase, if the commit to be examined has already been processed.
There is a DefaultVisitedDeduplicator
implementation, but it is likely that it requires too much memory during runtime, especially when the identify-run is configured with multiple GC policies and/or has to walk many commits. This DefaultVisitedDeduplicator
is present, but due to the mentioned concerns not available in the Nessie GC tool and the use of DefaultVisitedDeduplicator
is not supported at all, and not recommended.
Identified live contents repository¶
It is recommended to use an external database for the Nessie GC repository. This is especially recommended for big Nessie repositories.
Nessie GC runs against small-ish repositories do technically work with an in-memory repository. But, as the term “in memory” suggests, the identified live-contents-set, its state, duration, etc. cannot be inspected afterwards.
Pluggable code¶
Different parts / functionalities are quite isolated and abstracted to allow proper unit-testability and also allow reuse of similar functionality.
Examples of abstracted/isolated functionality:
- Functionality to recursively walk a base location
- Functionality to delete files
- Nessie GC repository
- Getting all data files for a specific content reference (think: Iceberg table snapshot)
- Commit-log-scanning duplicate work elimination
File references¶
All files (or objects, in case of an object store like S3) are described using a FileReference
, using a base URI plus a URI relative to the base URI. Noteworthy: the “sweep phase”, which “collects” all live files in a bloom filter and after that lists files in all base URIs, always uses only the relative URI, never the full URI, to check whether a file is orphan or probably not (bloom filter is probabilistic data structure).
Since object stores are the primary target, only files but not directories are supported. Object stores do not know about directories, further Iceberg’s FileIO
does not know about directories either. For file systems that do support directories this means, that empty directories will not be deleted, and prematurely deleting directories could break concurrent operations.
Runtime requirements¶
Nessie GC work is dominated by network and/or disk I/O, less by CPU and heap pressure.
Memory requirements (rough estimates):
- Number of concurrent content-scans (“sweep phase”) times the bloom-filter on-heap size (assume that can be a couple MB, depending on the expected number of files and allowed false-positive ratio).
- Duplicate-commit-log-walk elimination requires some amount of memory for each distinct cut-off time times the (possible) number of commits over the matching references.
- Additional memory is required for the currently processed chunks of metadata, for example Iceberg’s table-metadata and manifests, to identify the live data files. (The raw metadata is only read and processed, but not memoized.)
- An in-memory live-contents-repository (not recommended for production workloads) requires memory for all content-references.
CPU & heap pressure testing¶
Special “tests” (this and (this) have been used to verify that even a huge amount of objects does not let a tiny Java heap “explode” and not use excessive CPU resources. This “test” simulates a Nessie repository with many references, commits, contents and files per content version. Runs of that test using a profiler proved that the implementation requires little memory and little CPU - runtime is largely dominated by bloom-filter put and maybe-contains operations for the per-content-expire runs. Both tests proved the concept.
Deferred deletion¶
The default behavior is to immediately deletes orphan files. But it is also possible to record the files to be deleted and delete those later. The nessie-gc.jar
tool supports deferred deletion.
Non-Nessie use cases¶
Although all the above is designed for Nessie, it is possible to reuse the core implementation with “plain” Iceberg, effectively a complete replacement of Iceberg’s expire snapshots and delete orphan files, but without Iceberg’s implicit requirement of using Spark. Things needed for this:
- A “pure Iceberg” implementation of
org.projectnessie.gc.repository.RepositoryConnector
: - Return one reference per Iceberg table, derived from the underlying Iceberg catalog.
- Provide a commit log with one
Put
operation for each Iceberg snapshot. - (The
allContents
function can return an emptyStream
for the “pure Iceberg” use case.) - Existing functionality, the mark-and-sweep logic and the code in
nessie-gc-iceberg
andnessie-gc-iceberg-files
, can be reused without any changes.
Potential future enhancements¶
Since Nessie GC keeps track of all ever live content-references and all ever known base content locations, it is possible to identify …
- … the base content locations that are no longer used. In other words: storage of e.g. Iceberg tables that have been deleted and are no longer referenced in any live Nessie commit.
- … the content references (aka Iceberg snapshots) are no longer used. This information can be used to no longer expose the affected e.g. Iceberg snapshots in any table metadata.
Completely unreferenced contents¶
Files of contents that are not visible from any live Nessie commit can be completely removed. Detecting this situation is not directly supported by the above approach.
The live-contents-set generated by Nessie GC’s identify phase contains all content IDs that are “live”. Nessie (server) could (and this approach is really just a thought) help here, by sending the “live” content IDs to Nessie and Nessie returning one content object for all content IDs that are not contained in the set of “live” content IDs. Another implementation would then be responsible to inspect the returned contents and purge the base locations in the data lake, where the data files, manifests, etc were stored.
The above must not purge files for content IDs that have just been recently created.
Potential Iceberg specific enhancements¶
Nessie GC can easily identify the Iceberg snapshots, as each Nessie commit references exactly one Iceberg table snapshot. Nessie (the runtime/server) has no knowledge of whether a particular Iceberg partition-spec, sort-order or schema is used via any live Iceberg snapshot or not, because partition-specs, sort-orders and schemas are referenced via Iceberg manifests. Although Nessie GC can identify these three kinds of structures during the identification of live contents, the expire phase, that maps content references to files, does not respect Nessie commit order.
So it is necessary to fully think through a potential “expire specs/sort-orders/schemas”, keeping in mind:
- Commit-chains are important, because IDs of partition-specs, sort-orders and schemas are assigned sequentially (a 32 bit int).
- The logically same partition-specs, sort-orders and schemas may exist with different IDs on different Nessie references.
- Partition-specs, sort-orders and schemas are only maintained in table-metadata, but referenced from “deeper” structures (manifest list, manifest files, data files).
- Is it really worth to have an “expire specs/sort-orders/schemas”.