1
0
Fork 0
mirror of https://github.com/NixOS/nix synced 2025-07-02 05:11:47 +02:00

Document store object content addressing & improve JSON format

The JSON format no longer uses the legacy ATerm `r:` prefixing nonsese,
but separate fields.

Progress on #9866

Co-authored-by: Robert Hensing <roberth@users.noreply.github.com>
This commit is contained in:
John Ericson 2024-04-09 17:07:39 -04:00
parent ba2911b03b
commit 1c75af969a
21 changed files with 268 additions and 65 deletions

View file

@ -1,7 +1,9 @@
# Content-Addressing File System Objects
For many operations, Nix needs to calculate [a content addresses](@docroot@/glossary.md#gloss-content-address) of [a file system object][file system object].
Usually this is needed as part of content addressing [store objects], since store objects always have a root file system object.
Usually this is needed as part of
[content addressing store objects](../store-object/content-address.md),
since store objects always have a root file system object.
But some command-line utilities also just work on "raw" file system objects, not part of any store object.
Every content addressing scheme Nix uses ultimately involves feeding data into a [hash function](https://en.wikipedia.org/wiki/Hash_function), and getting back an opaque fixed-size digest which is deemed a content address.
@ -18,6 +20,9 @@ A single file object can just be hashed by its contents.
This is not enough information to encode the fact that the file system object is a file,
but if we *already* know that the FSO is a single non-executable file by other means, it is sufficient.
Because the hashed data is just the raw file, as is, this choice is good for compatibility with other systems.
For example, Unix commands like `sha256sum` or `sha1sum` will produce hashes for single files that match this.
### Nix Archive (NAR) { #serial-nix-archive }
For the other cases of [file system objects][file system object], especially directories with arbitrary descendents, we need a more complex serialisation format.
@ -69,7 +74,7 @@ every non-directory object is owned by a parent directory, and the entry that re
However, if the root object is not a directory, then we have no way of knowing which one of an executable file, non-executable file, or symlink it is supposed to be.
In response to this, we have decided to treat a bare file as non-executable file.
This is similar to do what we do with [flat serialisation](#flat), which also lacks this information.
This is similar to do what we do with [flat serialisation](#serial-flat), which also lacks this information.
To avoid an address collision, attempts to hash a bare executable file or symlink will result in an error (just as would happen for flat serialisation also).
Thus, Git can encode some, but not all of Nix's "File System Objects", and this sort of content-addressing is likewise partial.

View file

@ -0,0 +1,123 @@
# Content-Addressing Store Objects
Just [like][fso-ca] [File System Objects][File System Object],
[Store Objects][Store Object] can also be [content-addressed](@docroot@/glossary.md#gloss-content-addressed),
unless they are [input-addressed](@docroot@/glossary.md#gloss-input-addressed-store-object).
For store objects, the content address we produce will take the form of a [Store Path] rather than regular hash.
In particular, the content-addressing scheme will ensure that the digest of the store path is solely computed from the
- file system object graph (the root one and its children, if it has any)
- references
- [store directory](../store-path.md#store-directory)
- name
of the store object, and not any other information, which would not be an intrinsic property of that store object.
For the full specification of the algorithms involved, see the [specification of store path digests][sp-spec].
[File System Object]: ../file-system-object.md
[Store Object]: ../store-object.md
[Store Path]: ../store-path.md
## Content addressing each part of a store object
### File System Objects
With all currently supported store object content addressing methods, the file system object is always [content-addressed][fso-ca] first, and then that hash is incorporated into content address computation for the store object.
### References
With all currently supported store object content addressing methods,
other objects are referred to by their regular (string-encoded-) [store paths][Store Path].
Self-references however cannot be referred to by their path, because we are in the midst of describing how to compute that path!
> The alternative would require finding as hash function fixed point, i.e. the solution to an equation in the form
> ```
> digest = hash(..... || digest || ....)
> ```
> which is computationally infeasible.
> As far as we know, this is equivalent to finding a hash collision.
Instead we just have a "has self reference" boolean, which will end up affecting the digest.
### Name and Store Directory
These two items affect the digest in a way that is standard for store path digest computations and not specific to content-addressing.
Consult the [specification of store path digests][sp-spec] for further details.
## Content addressing Methods
For historical reasons, we don't support all features in all combinations.
Each currently supported method of content addressing chooses a single method of file system object hashing, and may offer some restrictions on references.
The names and store directories are unrestricted however.
### Flat { #method-flat }
This uses the corresponding [Flat](../file-system-object/content-address.md#serial-flat) method of file system object content addressing.
References are not supported: store objects with flat hashing *and* references can not be created.
### Text { #method-text }
This also uses the corresponding [Flat](../file-system-object/content-address.md#serial-flat) method of file system object content addressing.
References to other store objects are supported, but self references are not.
This is the only store-object content-addressing method that is not named identically with a corresponding file system object method.
It is somewhat obscure, mainly used for "drv files"
(derivations serialized as store objects in their ["ATerm" file format](@docroot@/protocols/derivation-aterm.md)).
Prefer another method if possible.
### Nix Archive { #method-nix-archive }
This uses the corresponding [Nix Archive](../file-system-object/content-address.md#serial-nix-archive) method of file system object content addressing.
References (to other store objects and self references alike) are supported so long as the hash algorithm is SHA-256, but not (neither kind) otherwise.
### Git { #method-git }
> **Warning**
>
> This method is part of the [`git-hashing`][xp-feature-git-hashing] experimental feature.
This uses the corresponding [Git](../file-system-object/content-address.md#serial-git) method of file system object content addressing.
References are not supported.
Only SHA-1 is supported at this time.
If [SHA-256-based Git](https://git-scm.com/docs/hash-function-transition)
becomes more widespread, this restriction will be revisited.
### Reproducibility
The above system is more complex than it needs to be to support all types of file system objects and references, owing to accretion of features over time.
However, there's a lot of value in supporting old expressions and reproducing the same hashes with any version of Nix.
Still, the fundamental property remains that if one knows how a store object is supposed to be hashed
--- all the non-Hash, non-references information above
--- one can recompute a store object's store path just from that metadata and its content proper (its references and file system objects).
Collectively, we can call this information the "content address method".
By storing the "Content address method" extra information as part of store object
--- making it data not metadata
--- we achieve the key property of making content-addressed store objects *trustless*.
What this is means is that they are just plain old data, not containing any "claim" that could be false.
All this information is free to vary, and if any of it varies one gets (ignoring the possibility of hash collisions, as usual) a different store path.
Store paths referring to content-addressed store objects uniquely identify a store object, and given that object, one can recompute the store path.
Any content-addressed store object purporting to be the referee of a store object can be readily verified to see whether it in fact does without any extra information.
No other party claiming a store object corresponds to a store path need be trusted because this verification can be done instead.
Content addressing currently is used when adding data like source code to the store.
Such data are "basal inputs", not produced from any other derivation (to our knowledge).
Content addressing is thus the only way to address them of our two options.
([Input addressing](@docroot@/glossary.md#gloss-input-addressed-store-object), is only valid for store paths produced from derivations.)
Additionally, content addressing is also used for the outputs of certain sorts of derivations.
It is very nice to be able to uniformly content-address all data rather than rely on a mix of content addressing and input addressing.
This however, is in some cases still experimental, so in practice input addressing is still (as of 2022) widely used.
[fso-ca]: ../file-system-object/content-address.md
[sp-spec]: @docroot@/protocols/store-path.md
[xp-feature-git-hashing]: @docroot@/contributing/experimental-features.md#xp-feature-git-hashing