Management Services¶
Nessie can and needs to manage several operations within your data lake. Each management service can be scheduled and Nessie reports the outcome of each scheduled operation. Scheduled operations require that Nessie have access to a Spark cluster to complete those operations and many of them are distributed compute operations.
Garbage Collection¶
Since Nessie is maintaining many versions of metadata and data-pointers simultaneously, you must rely on Nessie to clean up old data. Nessie calls this garbage collection.
There are at least two steps to a garbage collection action. The first steps are instructive, and the last step is destructive.
Info
currently the GC algorithm only works for Iceberg tables and Dynamo as a backend
Identify Unreferenced Assets¶
This is a spark job which should be run periodically to identify no longer referenced assets. Assets are defined as the set of files, records, entries etc. that make up a table, view or other Nessie object. For example, iceberg assets are: * manifest files * manifest lists * data files * metadata files * the entire table directory on disk (if it is empty)
To be marked as unreferenced an asset must either be: 1. No longer referenced by any branch or tag. For example, an entire branch was deleted, and a table on that branch is no longer accessible. 2. Assets created in a commit which has passed the (configurable) commit age. If they are not referenced by newer commits
Identifying unreferenced assets is a non-destructive action. The result of the spark job is a Spark DataFrame of all the unreferenced assets. This dataframe is stored in an iceberg table managed by nessie at a configurable key. This table is referencable in Nessie so can be examined via Spark or any other Nessie/Iceberg compatible engine. An example of the table output is shown below.
This action is designed to run concurrently to normal operational workloads and can/should be run regularly. This table is used as input into the destructive GC operation described below.
Configuration and running¶
GcActionsConfig actionsConfig, GcOptions gcConfig, TableIdentifier table The relevant configuration items are:
parameter | default value | description |
---|---|---|
table | null | The Iceberg TableIdentifier to which the unreferenced assets should be written |
GcOptions.getBloomFilterCapacity | 10000000 | Size (number of items) of bloom filter for identification of referenced values |
GcOptions.getMaxAgeMicros | 7 days | age at which a commit starts to expire |
GcOptions.getTimeSlopMicros | 1 day | minimum age a values can be before it will be considered expired |
GcActionsConfig.getDynamoRegion | provider default | AWS Region of the Nessie DynamoDB |
GcActionsConfig.getDynamoEndpoint | provider default | Custom AWS endpoint of the Nessie DynamoDB |
GcActionsConfig.getStoreType | DYNAMO | only backend which supports GC |
Running the action can be done simply by:
GcActionsConfig actionsConfig = GcActionsConfig.builder().build(); //use all defaults
GcOptions gcOptions = GcOptions.builder().build(); //use all defaults
GcActions actions = new GcActions.Builder(spark)
.setActionsConfig(actionsConfig)
.setGcOptions(gcOptions)
.setTable(TABLE_IDENTIFIER).build(); // (1)
Dataset<Row> unreferencedAssets = actions.identifyUnreferencedAssets(); // (2)
actions.updateUnreferencedAssetTable(unreferencedAssets); // (3)
Delete Unreferenced Assets¶
The destructive garbage collection step is also a Spark job and takes as input the table that has been built above. This job is modelled as an Iceberg Action and has a similar API to the other Iceberg Actions. In the future it will be registered with Iceberg’s Action APIs and callable via Iceberg’s custom SQL statements.
This Iceberg Action looks at the generated table from the Identify step and counts the number of times a distinct asset has been seen. Effectively it performs a group-by and count on this table. If the count of an asset is over a specified threshold AND it was seen in the last run of the Identify stage it is collectable. This asset is then deleted permanently. A report table of deleted object is returned to the user and either the records are removed from the ‘identify’ table or the whole table is purged.
Configuration and running¶
The relevant configuration items are:
parameter | default value | description |
---|---|---|
seenCount | 10 | How many times an asset has been seen as unreferenced in order to be considered for deletion |
deleteOnPurge | true | Delete records from the underlying iceberg table of unreferenced assets |
dropGcTable | true | Drop the underlying iceberg table or attempt to clean only the missing rows |
table | null | The iceberg Table which stores the list of unreferenced assets |
Running the action can be done simply by:
Table table = catalog.loadTable(TABLE_IDENTIFIER);
GcTableCleanAction.GcTableCleanResult result = new GcTableCleanAction(table, spark).dropGcTable(true).deleteCountThreshold(2).deleteOnPurge(true).execute();
TABLE_IDENTIFIER
which points to the unreferenced assets table. It also requires an active spark session and a nessie owned Catalog
. The result
object above returns the number of files the action tried to delete and the number that failed. Note
You can follow along an interactive demo in a Jupyter Notebook via Binder.
Internal Garbage collection¶
Currently, the only garbage collection algorithm available is on the values and assets in a Nessie database only. The internal records of the Nessie Database are currently not cleaned up. Unreferenced objects stored in Nessie’s internal database will be persisted forever currently. A future release will also clean up internal Nessie records if they are unreferenced.
Time-based AutoTagging¶
Info
This service is currently in progress and is not yet included in a released version of Nessie.
Nessie works against data based on a commit timeline. In many situations, it is useful to capture historical versions of data for analysis or comparison purposes. As such, you can configure Nessie to AutoTag (and auto-delete) using a timestamp based naming scheme. When enabled, Nessie will automatically generate and maintain tags based on time so that users can refer to historical data using timestamps as opposed to commits. This also works hand-in-hand with the Nessie garbage collection process by ensuring that older data is “referenced” and thus available for historical analysis.
Currently, there is one AutoTagging policy. By default, it creates the following tags:
- Hourly tags for the last 25 hours
- Daily tags for the last 8 days
- Weekly tags for the last 6 weeks
- Monthly tags for the last 13 months
- Yearly tags for the last 3 years
Tags are automatically named using a date/
prefix and a zero-extended underscore based naming scheme. For example: date/2019_09_07_15_50
would be a tag for August 7, 2019 at 3:50pm.
Warning
AutoTags are automatically deleted once the policy rolls-over. As such, if retention is desired post roll-over, manual tags should be created.
AutoTagging is currently done based on the UTC roll-over of each item.
Manifest Reorganization¶
Info
This service is currently in progress and is not yet included in a released version of Nessie.
Rewrites the manifests associated with a table so that manifest files are organized around partitions. This extends on the ideas in the Iceberg RewriteManifestsAction
.
Note
Manifest reorganization will show up as a commit, like any other table operation.
Key configuration parameters:
Name | Default | Meaning |
---|---|---|
effort | medium | How much rewriting is allowed to achieve the goals |
target manifest size | 8mb | What is the target |
partition priority | medium | How important achieving partition-oriented manifests. |
Compaction¶
Info
This service is currently in progress and is not yet included in a released version of Nessie.
Because operations against table formats are done at the file level, a table can start to generate many small files. These small files will slow consumption. As such, Nessie can automatically run jobs to compact tables to ensure a consistent level of performance.
Name | Default | Meaning |
---|---|---|
Maximum Small Files | 10.0 | Maximum number of small files as a ratio to large files |
Maximum Delete Files | 10.0 | Maximum number of delete tombstones as a ratio to other files before merging the tombstones into a consolidated file |
Small File Size | 100mb | Size of file before it is considered small |
Target Rewrite Size | 256mb | The target size for splittable units when rewriting data. |
Note
Compaction will show up as a commit, like any other table operation.