From c868d8e91fee01999eb34c5559250ea059293b33 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 6 Sep 2022 19:01:34 -0400 Subject: [PATCH 1/4] parse_object(): allow skipping hash check The parse_object() function checks the object hash of any object it parses. This is a nice feature, as it means we may catch bit corruption during normal use, rather than waiting for specific fsck operations. But it also can be slow. It's particularly noticeable for blobs, where except for the hash check, we could return without loading the object contents at all. Now one may wonder what is the point of calling parse_object() on a blob in the first place then, but usually it's not intentional: we were fed an oid from somewhere, don't know the type, and want an object struct. For commits and trees, the parsing is usually helpful; we're about to look at the contents anyway. But this is less true for blobs, where we may be collecting them as part of a reachability traversal, etc, and don't actually care what's in them. And blobs, of course, tend to be larger. We don't want to just throw out the hash-checks for blobs, though. We do depend on them in some circumstances (e.g., rev-list --verify-objects uses parse_object() to check them). It's only the callers that know how they're going to use the result. And so we can help them by providing a special flag to skip the hash check. We could just apply this to blobs, as they're going to be the main source of performance improvement. But if a caller doesn't care about checking the hash, we might as well skip it for other object types, too. Even though we can't avoid reading the object contents, we can still skip the actual hash computation. If this seems like it is making Git a little bit less safe against corruption, it may be. But it's part of a series of tradeoffs we're already making. For instance, "rev-list --objects" does not open the contents of blobs it prints. And when a commit graph is present, we skip opening most commits entirely. The important thing will be to use this flag in cases where it's safe to skip the check. For instance, when serving a pack for a fetch, we know the client will fully index the objects and do a connectivity check itself. There's little to be gained from the server side re-hashing a blob itself. And indeed, most of the time we don't! The revision machinery won't open up a blob reached by traversal, but only one requested directly with a "want" line. So applied properly, this new feature shouldn't make anything less safe in practice. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- object.c | 15 ++++++++++++--- object.h | 6 ++++++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/object.c b/object.c index 588b8156f1..8f6de09078 100644 --- a/object.c +++ b/object.c @@ -263,8 +263,11 @@ struct object *parse_object_or_die(const struct object_id *oid, die(_("unable to parse object: %s"), name ? name : oid_to_hex(oid)); } -struct object *parse_object(struct repository *r, const struct object_id *oid) +struct object *parse_object_with_flags(struct repository *r, + const struct object_id *oid, + enum parse_object_flags flags) { + int skip_hash = !!(flags & PARSE_OBJECT_SKIP_HASH_CHECK); unsigned long size; enum object_type type; int eaten; @@ -279,7 +282,7 @@ struct object *parse_object(struct repository *r, const struct object_id *oid) if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) || (!obj && repo_has_object_file(r, oid) && oid_object_info(r, oid, NULL) == OBJ_BLOB)) { - if (stream_object_signature(r, repl) < 0) { + if (!skip_hash && stream_object_signature(r, repl) < 0) { error(_("hash mismatch %s"), oid_to_hex(oid)); return NULL; } @@ -289,7 +292,8 @@ struct object *parse_object(struct repository *r, const struct object_id *oid) buffer = repo_read_object_file(r, oid, &type, &size); if (buffer) { - if (check_object_signature(r, repl, buffer, size, type) < 0) { + if (!skip_hash && + check_object_signature(r, repl, buffer, size, type) < 0) { free(buffer); error(_("hash mismatch %s"), oid_to_hex(repl)); return NULL; @@ -304,6 +308,11 @@ struct object *parse_object(struct repository *r, const struct object_id *oid) return NULL; } +struct object *parse_object(struct repository *r, const struct object_id *oid) +{ + return parse_object_with_flags(r, oid, 0); +} + struct object_list *object_list_insert(struct object *item, struct object_list **list_p) { diff --git a/object.h b/object.h index 9caef89f1f..31ebe11458 100644 --- a/object.h +++ b/object.h @@ -128,7 +128,13 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet); * * Returns NULL if the object is missing or corrupt. */ +enum parse_object_flags { + PARSE_OBJECT_SKIP_HASH_CHECK = 1 << 0, +}; struct object *parse_object(struct repository *r, const struct object_id *oid); +struct object *parse_object_with_flags(struct repository *r, + const struct object_id *oid, + enum parse_object_flags flags); /* * Like parse_object, but will die() instead of returning NULL. If the From 0bc25579511c61ab36d944fc88207b470c0a34d5 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 6 Sep 2022 19:05:42 -0400 Subject: [PATCH 2/4] upload-pack: skip parse-object re-hashing of "want" objects MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Imagine we have a history with commit C pointing to a large blob B. If a client asks us for C, we can generally serve both objects to them without accessing the uncompressed contents of B. In upload-pack, we figure out which commits we have and what the client has, and feed those tips to pack-objects. In pack-objects, we traverse the commits and trees (or use bitmaps!) to find the set of objects needed, but we never open up B. When we serve it to the client, we can often pass the compressed bytes directly from the on-disk packfile over the wire. But if a client asks us directly for B, perhaps because they are doing an on-demand fetch to fill in the missing blob of a partial clone, we end up much slower. Upload-pack calls parse_object() on the oid we receive, which opens up the object and re-checks its hash (even though if it were a commit, we might skip this parse entirely in favor of the commit graph!). And then we feed the oid directly to pack-objects, which again calls parse_object() and opens the object. And then finally, when we write out the result, we may send bytes straight from disk, but only after having unnecessarily uncompressed and computed the sha1 of the object twice! This patch teaches both code paths to use the new SKIP_HASH_CHECK flag for parse_object(). You can see the speed-up in p5600, which does a blob:none clone followed by a checkout. The savings for git.git are modest: Test HEAD^ HEAD ---------------------------------------------------------------------- 5600.3: checkout of result 2.23(4.19+0.24) 1.72(3.79+0.18) -22.9% But the savings scale with the number of bytes. So on a repository like linux.git with more files, we see more improvement (in both absolute and relative numbers): Test HEAD^ HEAD ---------------------------------------------------------------------------- 5600.3: checkout of result 51.62(77.26+2.76) 34.86(61.41+2.63) -32.5% And here's an even more extreme case. This is the android gradle-plugin repository, whose tip checkout has ~3.7GB of files: Test HEAD^ HEAD -------------------------------------------------------------------------- 5600.3: checkout of result 79.51(90.84+5.55) 40.28(51.88+5.67) -49.3% Keep in mind that these timings are of the whole checkout operation. So they count the client indexing the pack and actually writing out the files. If we want to see just the server's view, we can hack up the GIT_TRACE_PACKET output from those operations and replay it via upload-pack. For the gradle example, that gives me: Benchmark 1: GIT_PROTOCOL=version=2 git.old upload-pack ../gradle-plugin Signed-off-by: Junio C Hamano --- revision.c | 4 +++- t/t1450-fsck.sh | 20 ++++++++++++++++++++ upload-pack.c | 3 ++- 3 files changed, 25 insertions(+), 2 deletions(-) diff --git a/revision.c b/revision.c index ee702e498a..786e090785 100644 --- a/revision.c +++ b/revision.c @@ -384,7 +384,9 @@ static struct object *get_reference(struct rev_info *revs, const char *name, if (commit) object = &commit->object; else - object = parse_object(revs->repo, oid); + object = parse_object_with_flags(revs->repo, oid, + revs->verify_objects ? 0 : + PARSE_OBJECT_SKIP_HASH_CHECK); if (!object) { if (revs->ignore_missing) diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index 53c2aa10b7..6410eff4e0 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -507,6 +507,26 @@ test_expect_success 'rev-list --verify-objects with bad sha1' ' test_i18ngrep -q "error: hash mismatch $(dirname $new)$(test_oid ff_2)" out ' +# An actual bit corruption is more likely than swapped commits, but +# this provides an easy way to have commits which don't match their purported +# hashes, but which aren't so broken we can't read them at all. +test_expect_success 'rev-list --verify-objects notices swapped commits' ' + git init swapped-commits && + ( + cd swapped-commits && + test_commit one && + test_commit two && + one_oid=$(git rev-parse HEAD) && + two_oid=$(git rev-parse HEAD^) && + one=.git/objects/$(test_oid_to_path $one_oid) && + two=.git/objects/$(test_oid_to_path $two_oid) && + mv $one tmp && + mv $two $one && + mv tmp $two && + test_must_fail git rev-list --verify-objects HEAD + ) +' + test_expect_success 'force fsck to ignore double author' ' git cat-file commit HEAD >basis && sed "s/^author .*/&,&/" multiple-authors && diff --git a/upload-pack.c b/upload-pack.c index b217a1f469..4bacdf087c 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -1420,7 +1420,8 @@ static int parse_want(struct packet_writer *writer, const char *line, if (commit) o = &commit->object; else - o = parse_object(the_repository, &oid); + o = parse_object_with_flags(the_repository, &oid, + PARSE_OBJECT_SKIP_HASH_CHECK); if (!o) { packet_writer_error(writer, From 9a8c3c4a5f94aaed54ae26e4796c08ca9a1cbfbc Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 6 Sep 2022 19:06:25 -0400 Subject: [PATCH 3/4] parse_object(): check commit-graph when skip_hash set If the caller told us that they don't care about us checking the object hash, then we're free to implement any optimizations that get us the parsed value more quickly. An obvious one is to check the commit graph before loading an object from disk. And in fact, both of the callers who pass in this flag are already doing so before they call parse_object()! So we can simplify those callers, as well as any possible future ones, by moving the logic into parse_object(). There are two subtle things to note in the diff, but neither has any impact in practice: - it seems least-surprising here to do the graph lookup on the git-replace'd oid, rather than the original. This is in theory a change of behavior from the earlier code, as neither caller did a replace lookup itself. But in practice it doesn't matter, as we disable the commit graph entirely if there are any replace refs. - the caller in get_reference() passes the skip_hash flag only if revs->verify_objects isn't set, whereas it would look in the commit graph unconditionally. In practice this should not matter as we should disable the commit graph entirely when using verify_objects (and that was done recently in another patch). So this should be a pure cleanup with no behavior change. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- object.c | 6 ++++++ revision.c | 16 +++------------- upload-pack.c | 9 ++------- 3 files changed, 11 insertions(+), 20 deletions(-) diff --git a/object.c b/object.c index 8f6de09078..2e4589bae5 100644 --- a/object.c +++ b/object.c @@ -279,6 +279,12 @@ struct object *parse_object_with_flags(struct repository *r, if (obj && obj->parsed) return obj; + if (skip_hash) { + struct commit *commit = lookup_commit_in_graph(r, repl); + if (commit) + return &commit->object; + } + if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) || (!obj && repo_has_object_file(r, oid) && oid_object_info(r, oid, NULL) == OBJ_BLOB)) { diff --git a/revision.c b/revision.c index 786e090785..a04ebd6139 100644 --- a/revision.c +++ b/revision.c @@ -373,20 +373,10 @@ static struct object *get_reference(struct rev_info *revs, const char *name, unsigned int flags) { struct object *object; - struct commit *commit; - /* - * If the repository has commit graphs, we try to opportunistically - * look up the object ID in those graphs. Like this, we can avoid - * parsing commit data from disk. - */ - commit = lookup_commit_in_graph(revs->repo, oid); - if (commit) - object = &commit->object; - else - object = parse_object_with_flags(revs->repo, oid, - revs->verify_objects ? 0 : - PARSE_OBJECT_SKIP_HASH_CHECK); + object = parse_object_with_flags(revs->repo, oid, + revs->verify_objects ? 0 : + PARSE_OBJECT_SKIP_HASH_CHECK); if (!object) { if (revs->ignore_missing) diff --git a/upload-pack.c b/upload-pack.c index 4bacdf087c..35fe1a3dbb 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -1409,19 +1409,14 @@ static int parse_want(struct packet_writer *writer, const char *line, const char *arg; if (skip_prefix(line, "want ", &arg)) { struct object_id oid; - struct commit *commit; struct object *o; if (get_oid_hex(arg, &oid)) die("git upload-pack: protocol error, " "expected to get oid, not '%s'", line); - commit = lookup_commit_in_graph(the_repository, &oid); - if (commit) - o = &commit->object; - else - o = parse_object_with_flags(the_repository, &oid, - PARSE_OBJECT_SKIP_HASH_CHECK); + o = parse_object_with_flags(the_repository, &oid, + PARSE_OBJECT_SKIP_HASH_CHECK); if (!o) { packet_writer_error(writer, From 945ed0095749a12a5eea03d3ec9512516d2f2b85 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Wed, 7 Sep 2022 17:02:20 -0400 Subject: [PATCH 4/4] t1060: check partial clone of misnamed blob A recent commit (upload-pack: skip parse-object re-hashing of "want" objects, 2022-09-06) loosened the behavior of upload-pack so that it does not verify the sha1 of objects it receives directly via "want" requests. The existing corruption tests in t1060 aren't affected by this: the corruptions are blobs reachable from commits, and the client requests the commits. The more interesting case here is a partial clone, where the client will directly ask for the corrupted blob when it does an on-demand fetch of the filtered object. And that is not covered at all, so let's add a test. It's important here that we use the "misnamed" corruption and not "bit-error". The latter is sufficiently corrupted that upload-pack cannot even figure out the type of the object, so it bails identically both before and after the recent change. But with "misnamed", with the hash-checks enabled it sees the problem (though the error messages are a bit confusing because of the inability to create a "struct object" to store the flags): error: hash mismatch d95f3ad14dee633a758d2e331151e950dd13e4ed fatal: git upload-pack: not our ref d95f3ad14dee633a758d2e331151e950dd13e4ed fatal: remote error: upload-pack: not our ref d95f3ad14dee633a758d2e331151e950dd13e4ed After the change to skip the hash check, the server side happily sends the bogus object, but the client correctly realizes that it did not get the necessary data: remote: Enumerating objects: 1, done. remote: Counting objects: 100% (1/1), done. remote: Total 1 (delta 0), reused 0 (delta 0), pack-reused 0 Receiving objects: 100% (1/1), 49 bytes | 49.00 KiB/s, done. fatal: bad revision 'd95f3ad14dee633a758d2e331151e950dd13e4ed' error: [...]/misnamed did not send all necessary objects which is exactly what we expect to happen. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- t/t1060-object-corruption.sh | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/t/t1060-object-corruption.sh b/t/t1060-object-corruption.sh index 5b8e47e346..35261afc9d 100755 --- a/t/t1060-object-corruption.sh +++ b/t/t1060-object-corruption.sh @@ -139,4 +139,11 @@ test_expect_success 'internal tree objects are not "missing"' ' ) ' +test_expect_success 'partial clone of corrupted repository' ' + test_config -C misnamed uploadpack.allowFilter true && + git clone --no-local --no-checkout --filter=blob:none \ + misnamed corrupt-partial && \ + test_must_fail git -C corrupt-partial checkout --force +' + test_done