Commit Graph

747 Commits

Author SHA1 Message Date
Junio C Hamano
8d6413a1be Merge branch 'ps/refname-avail-check-optim'
The code paths to check whether a refname X is available (by seeing
if another ref X/Y exists, etc.) have been optimized.

* ps/refname-avail-check-optim:
  refs: reuse iterators when determining refname availability
  refs/iterator: implement seeking for files iterators
  refs/iterator: implement seeking for packed-ref iterators
  refs/iterator: implement seeking for ref-cache iterators
  refs/iterator: implement seeking for reftable iterators
  refs/iterator: implement seeking for merged iterators
  refs/iterator: provide infrastructure to re-seek iterators
  refs/iterator: separate lifecycle from iteration
  refs: stop re-verifying common prefixes for availability
  refs/files: batch refname availability checks for initial transactions
  refs/files: batch refname availability checks for normal transactions
  refs/reftable: batch refname availability checks
  refs: introduce function to batch refname availability checks
  builtin/update-ref: skip ambiguity checks when parsing object IDs
  object-name: allow skipping ambiguity checks in `get_oid()` family
  object-name: introduce `repo_get_oid_with_flags()`
2025-03-29 16:39:07 +09:00
Junio C Hamano
de35b7b3ff Merge branch 'sj/ref-consistency-checks-more'
"git fsck" becomes more careful when checking the refs.

* sj/ref-consistency-checks-more:
  builtin/fsck: add `git refs verify` child process
  packed-backend: check whether the "packed-refs" is sorted
  packed-backend: add "packed-refs" entry consistency check
  packed-backend: check whether the refname contains NUL characters
  packed-backend: add "packed-refs" header consistency check
  packed-backend: check if header starts with "# pack-refs with: "
  packed-backend: check whether the "packed-refs" is regular file
  builtin/refs: get worktrees without reading head information
  t0602: use subshell to ensure working directory unchanged
2025-03-26 16:26:10 +09:00
Patrick Steinhardt
a95da5c8ae refs/iterator: implement seeking for files iterators
Implement seeking for "files" iterators. As we simply use a ref-cache
iterator under the hood the implementation is straight-forward. Note
that we do not implement seeking on reflog iterators, same as with the
"reftable" backend.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:20 -07:00
Patrick Steinhardt
22600c0452 refs/iterator: implement seeking for packed-ref iterators
Implement seeking of `packed-ref` iterators. The implementation is again
straight forward, except that we cannot continue to use the prefix
iterator as we would otherwise not be able to reseek the iterator
anymore in case one first asks for an empty and then for a non-empty
prefix. Instead, we open-code the logic to in `advance()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:20 -07:00
Patrick Steinhardt
84e656919c refs/iterator: implement seeking for ref-cache iterators
Implement seeking of ref-cache iterators. This is done by splitting most
of the logic to seek iterators out of `cache_ref_iterator_begin()` and
putting it into `cache_ref_iterator_seek()` so that we can reuse the
logic.

Note that we cannot use the optimization anymore where we return an
empty ref iterator when there aren't any references, as otherwise it
wouldn't be possible to reseek the iterator to a different prefix that
may exist. This shouldn't be much of a performance concern though as we
now start to bail out early in case `advance()` sees that there are no
more directories to be searched.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:20 -07:00
Patrick Steinhardt
53de20c931 refs/iterator: implement seeking for reftable iterators
Implement seeking of reftable iterators. As the low-level reftable
iterators already support seeking this change is straight-forward. Two
notes though:

  - We do not support seeking on reflog iterators. It is unclear what
    seeking would even look like in this context, as you typically would
    want to seek to a specific entry in the reflog for a specific ref.
    There is currently no use case for this, but if one arises in the
    future, we can still implement seeking at that later point.

  - We start to check whether `reftable_stack_init_ref_iterator()` is
    successful.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:19 -07:00
Patrick Steinhardt
9821d90f13 refs/iterator: implement seeking for merged iterators
Implement seeking on merged iterators. The implementation is rather
straight forward, with the only exception that we must not deallocate
the underlying iterators once they have been exhausted.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:19 -07:00
Patrick Steinhardt
82c39c6055 refs/iterator: provide infrastructure to re-seek iterators
Reftable iterators need to be scrapped after they have either been
exhausted or aren't useful to the caller anymore, and it is explicitly
not possible to reuse them for iterations. But enabling for reuse of
iterators may allow us to tune them by reusing internal state of an
iterator. The reftable iterators for example can already be reused
internally, but we're not able to expose this to any users outside of
the reftable backend.

Introduce a new `.seek` function in the ref iterator vtable that allows
callers to seek an iterator multiple times. It is expected to be
functionally the same as calling `refs_ref_iterator_begin()` with a
different (or the same) prefix.

Note that it is not possible to adjust parameters other than the seeked
prefix for now, so exclude patterns, trimmed prefixes and flags will
remain unchanged. We do not have a usecase for changing these parameters
right now, but if we ever find one we can adapt accordingly.

Implement the callback for trivial cases. The other iterators will be
implemented in subsequent commits.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:19 -07:00
Patrick Steinhardt
cec2b6f55a refs/iterator: separate lifecycle from iteration
The ref and reflog iterators have their lifecycle attached to iteration:
once the iterator reaches its end, it is automatically released and the
caller doesn't have to care about that anymore. When the iterator should
be released before it has been exhausted, callers must explicitly abort
the iterator via `ref_iterator_abort()`.

This lifecycle is somewhat unusual in the Git codebase and creates two
problems:

  - Callsites need to be very careful about when exactly they call
    `ref_iterator_abort()`, as calling the function is only valid when
    the iterator itself still is. This leads to somewhat awkward calling
    patterns in some situations.

  - It is impossible to reuse iterators and re-seek them to a different
    prefix. This feature isn't supported by any iterator implementation
    except for the reftable iterators anyway, but if it was implemented
    it would allow us to optimize cases where we need to search for
    specific references repeatedly by reusing internal state.

Detangle the lifecycle from iteration so that we don't deallocate the
iterator anymore once it is exhausted. Instead, callers are now expected
to always call a newly introduce `ref_iterator_free()` function that
deallocates the iterator and its internal state.

Note that the `dir_iterator` is somewhat special because it does not
implement the `ref_iterator` interface, but is only used to implement
other iterators. Consequently, we have to provide `dir_iterator_free()`
instead of `dir_iterator_release()` as the allocated structure itself is
managed by the `dir_iterator` interfaces, as well, and not freed by
`ref_iterator_free()` like in all the other cases.

While at it, drop the return value of `ref_iterator_abort()`, which
wasn't really required by any of the iterator implementations anyway.
Furthermore, stop calling `base_ref_iterator_free()` in any of the
backends, but instead call it in `ref_iterator_free()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:18 -07:00
Patrick Steinhardt
268ea8515c refs/files: batch refname availability checks for initial transactions
The "files" backend explicitly carves out special logic for its initial
transaction so that it can avoid writing out every single reference as
a loose reference. While the assumption is that there shouldn't be any
preexisting references, we still have to verify that none of the newly
written references will conflict with any other new reference in the
same transaction.

Refactor the initial transaction to use batched refname availability
checks. This does not yet have an effect on performance as we still call
`refs_verify_refname_available()` in a loop. But this will change in
subsequent commits and then impact performance when cloning a repository
with many references or when migrating references to the "files" format.

This will improve performance when cloning a repository with many
references or when migrating references from any format to the "files"
format once the availability checks have learned to optimize checks for
many references in a subsequent commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:18 -07:00
Patrick Steinhardt
6c90726beb refs/files: batch refname availability checks for normal transactions
Same as the "reftable" backend that we have adapted in the preceding
commit to use batched refname availability checks we can also do so for
the "files" backend. Things are a bit more intricate here though, as we
call `refs_verify_refname_available()` in a set of different contexts:

  1. `lock_raw_ref()` when it hits either EEXISTS or EISDIR when creating
     a new reference, mostly to create a nice, user-readable error
     message. This is nothing we have to care about too much, as we only
     hit this code path at most once when we hit a conflict.

  2. `lock_raw_ref()` when it _could_ create the lockfile to check
     whether it is conflicting with any packed refs. In the general case,
     this code path will be hit once for every (successful) reference
     update.

  3. `lock_ref_oid_basic()`, but it is only executed when copying or
     renaming references or when expiring reflogs. It will thus not be
     called in contexts where we have many references queued up.

  4. `refs_refname_ref_available()`, but again only when copying or
     renaming references. It is thus not interesting due to the same
     reason as the previous case.

  5. `files_transaction_finish_initial()`, which is only executed when
     creating a new repository or migrating references.

So out of these, only (2) and (5) are viable candidates to use the
batched checks.

Adapt `lock_raw_ref()` accordingly by queueing up reference names that
need to be checked for availability and then checking them after we have
processed all updates. This check is done before we (optionally) lock
the `packed-refs` file, which is somewhat flawed because it means that
the `packed-refs` could still change after the availability check and
thus create an undetected conflict. But unconditionally locking the file
would change semantics that users are likely to rely on, so we keep the
current locking sequence intact, even if it's suboptmial.

The refactoring of `files_transaction_finish_initial()` will be done in
the next commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:17 -07:00
Patrick Steinhardt
351f592e1d refs/reftable: batch refname availability checks
Refactor the "reftable" backend to batch the availability check for
refnames. This does not yet have an effect on performance as
`refs_verify_refnames_available()` effectively still performs the
availability check for each refname individually. But this will be
optimized in subsequent commits, where we learn to optimize some parts
of the logic when checking multiple refnames for availability.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-12 11:31:17 -07:00
Junio C Hamano
feffb34257 Merge branch 'ps/path-sans-the-repository'
The path.[ch] API takes an explicit repository parameter passed
throughout the callchain, instead of relying on the_repository
singleton instance.

* ps/path-sans-the-repository:
  path: adjust last remaining users of `the_repository`
  environment: move access to "core.sharedRepository" into repo settings
  environment: move access to "core.hooksPath" into repo settings
  repo-settings: introduce function to clear struct
  path: drop `git_path()` in favor of `repo_git_path()`
  rerere: let `rerere_path()` write paths into a caller-provided buffer
  path: drop `git_common_path()` in favor of `repo_common_path()`
  worktree: return allocated string from `get_worktree_git_dir()`
  path: drop `git_path_buf()` in favor of `repo_git_path_replace()`
  path: drop `git_pathdup()` in favor of `repo_git_path()`
  path: drop unused `strbuf_git_path()` function
  path: refactor `repo_submodule_path()` family of functions
  submodule: refactor `submodule_to_gitdir()` to accept a repo
  path: refactor `repo_worktree_path()` family of functions
  path: refactor `repo_git_path()` family of functions
  path: refactor `repo_common_path()` family of functions
2025-03-05 10:37:43 -08:00
Patrick Steinhardt
028f618658 path: adjust last remaining users of the_repository
With the preceding refactorings we now only have a couple of implicit
users of `the_repository` left in the "path" subsystem, all of which
depend on global state via `calc_shared_perm()`. Make the dependency on
`the_repository` explicit by passing the repo as a parameter instead and
adjust callers accordingly.

Note that this change bubbles up into a couple of subsystems that were
previously declared as free from `the_repository`. Instead of marking
all of them as `the_repository`-dependent again, we instead use the
repository that is available in the calling context. There are three
exceptions though with "copy.c", "pack-write.c" and "tempfile.c".
Adjusting these would require us to adapt callsites all over the place,
so this is left for a future iteration.

Mark "path.c" as free from `the_repository`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-28 13:54:11 -08:00
shejialuo
e1c9548eae packed-backend: check whether the "packed-refs" is sorted
When there is a "sorted" trait in the header of the "packed-refs" file,
it means that each entry is sorted increasingly by comparing the
refname. We should add checks to verify whether the "packed-refs" is
sorted in this case.

Update the "packed_fsck_ref_header" to know whether there is a "sorted"
trail in the header. It may seem that we could record all refnames
during the parsing process and then compare later. However, this is not
a good design due to the following reasons:

1. Because we need to store the state across the whole checking
   lifetime, we would consume a lot of memory if there are many entries
   in the "packed-refs" file.
2. We cannot reuse the existing compare function "cmp_packed_ref_records"
   which cause repetition.

Because "cmp_packed_ref_records" needs an extra parameter "struct
snaphost", extract the common part into a new function
"cmp_packed_ref_records" to reuse this function to compare.

Then, create a new function "packed_fsck_ref_sorted" to parse the file
again and user the new fsck message "packedRefUnsorted(ERROR)" to report
to the user if the file is not sorted.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-27 14:03:09 -08:00
shejialuo
e6ba4c07b8 packed-backend: add "packed-refs" entry consistency check
"packed-backend.c::next_record" will parse the ref entry to check the
consistency. This function has already checked the following things:

1. Parse the main line of the ref entry to inspect whether the oid is
   not correct. Then, check whether the next character is oid. Then
   check the refname.
2. If the next line starts with '^', it would continue to parse the
   peeled oid and check whether the last character is '\n'.

As we decide to implement the ref consistency check for "packed-refs",
let's port these two checks and update the test to exercise the code.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-27 14:03:08 -08:00
shejialuo
5637d55420 packed-backend: check whether the refname contains NUL characters
"packed-backend.c::next_record" will use "check_refname_format" to check
the consistency of the refname. If it is not OK, the program will die.
However, it is reported in [1], we cannot catch some corruption. But we
already have the code path and we must miss out something.

We use the following code to get the refname:

    strbuf_add(&iter->refname_buf, p, eol - p);
    iter->base.refname = iter->refname_buf.buf

In the above code, `p` is the start pointer of the refname and `eol` is
the next newline pointer. We calculate the length of the refname by
subtracting the two pointers. Then we add the memory range between `p`
and `eol` to get the refname.

However, if there are some NUL characters in the memory range between `p`
and `eol`, we will see the refname as a valid ref name as long as the
memory range between `p` and first occurred NUL character is valid.

In order to catch above corruption, create a new function
"refname_contains_nul" by searching the first NUL character. If it is
not at the end of the string, there must be some NUL characters in the
refname.

Use this function in "next_record" function to die the program if
"refname_contains_nul" returns true.

[1] https://lore.kernel.org/git/6cfee0e4-3285-4f18-91ff-d097da9de737@rd10.de/

Reported-by: R. Diez <rdiez-temp3@rd10.de>
Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-27 14:03:08 -08:00
shejialuo
c92e7e156e packed-backend: add "packed-refs" header consistency check
In "packed-backend.c::create_snapshot", if there is a header (the line
which starts with '#'), we will check whether the line starts with "#
pack-refs with: ". However, we need to consider other situations and
discuss whether we need to add checks.

1. If the header does not exist, we should not report an error to the
   user. This is because in older Git version, we never write header in
   the "packed-refs" file. Also, we do allow no header in "packed-refs"
   in runtime.
2. If the header content does not start with "# packed-ref with: ", we
   should report an error just like what "create_snapshot" does. So,
   create a new fsck message "badPackedRefHeader(ERROR)" for this.
3. If the header content is not the same as the constant string
   "PACKED_REFS_HEADER". This is expected because we make it extensible
   intentionally and runtime "create_snapshot" won't complain about
   unknown traits. In order to align with the runtime behavior. There is
   no need to report.

As we have analyzed, we only need to check the case 2 in the above. In
order to do this, use "open_nofollow" function to get the file
descriptor and then read the "packed-refs" file via "strbuf_read". Like
what "create_snapshot" and other functions do, we could split the line
by finding the next newline in the buffer. When we cannot find a
newline, we could report an error.

So, create a function "packed_fsck_ref_next_line" to find the next
newline and if there is no such newline, use
"packedRefEntryNotTerminated(ERROR)" to report an error to the user.

Then, parse the first line to apply the checks. Update the test to
exercise the code.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-27 14:03:08 -08:00
shejialuo
515579756c packed-backend: check if header starts with "# pack-refs with: "
We always write a space after "# pack-refs with:" but we don't align
with this rule in the "create_snapshot" method where we would check
whether header starts with "# pack-refs with:". It might seem that we
should undoubtedly tighten this rule, however, we don't have any
technical documentation about this and there is a possibility that we
would break the compatibility for other third-party libraries.

By investigating influential third-party libraries, we could conclude
how these libraries handle the header of "packed-refs" file:

1. libgit2 is fine and always writes the space. It also expects the
   whitespace to exist.
2. JGit does not expect th header to have a trailing space, but expects
   the "peeled" capability to have a leading space, which is mostly
   equivalent because that capability is typically the first one we
   write. It always writes the space.
3. gitoxide expects the space t exist and writes it.
4. go-git doesn't create the header by default.

As many third-party libraries expect a single space after "# pack-refs
with:", if we forget to write the space after the colon,
"create_snapshot" won't catch this. And we would break other
re-implementations. So, we'd better tighten the rule by checking whether
the header starts with "# pack-refs with: ".

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-27 14:03:07 -08:00
shejialuo
cfea2f2da8 packed-backend: check whether the "packed-refs" is regular file
Although "git-fsck(1)" and "packed-backend.c" will check some
consistency and correctness of "packed-refs" file, they never check the
filetype of the "packed-refs". Let's verify that the "packed-refs" has
the expected filetype, confirming it is created by "git pack-refs"
command.

We could use "open_nofollow" wrapper to open the raw "packed-refs" file.
If the returned "fd" value is less than 0, we could check whether the
"errno" is "ELOOP" to report an error to the user. And then we use
"fstat" to check whether the "packed-refs" file is a regular file.

Reuse "FSCK_MSG_BAD_REF_FILETYPE" fsck message id to report the error to
the user if "packed-refs" is not a regular file.

Mentored-by: Patrick Steinhardt <ps@pks.im>
Mentored-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-27 14:03:07 -08:00
Junio C Hamano
82522a9e2c Merge branch 'kn/reflog-migration-fix-followup'
Code clean-up.

* kn/reflog-migration-fix-followup:
  reftable: prevent 'update_index' changes after adding records
  refs: use 'uint64_t' for 'ref_update.index'
  refs: mark `ref_transaction_update_reflog()` as static
2025-02-14 17:53:48 -08:00
Junio C Hamano
1f124f3024 Merge branch 'kn/reflog-migration-fix-fix'
Fix bugs in an earlier attempt to fix "git refs migration".

* kn/reflog-migration-fix-fix:
  refs/reftable: fix uninitialized memory access of `max_index`
  reftable: write correct max_update_index to header
2025-02-03 10:23:35 -08:00
Junio C Hamano
d205f06ae0 Merge branch 'kn/reflog-symref-fix'
reflog entries for symbolic ref updates were broken, which has been
corrected.

* kn/reflog-symref-fix:
  refs: fix creation of reflog entries for symrefs
2025-01-29 14:05:10 -08:00
Karthik Nayak
f11f0a5a2d refs/reftable: fix uninitialized memory access of max_index
When migrating reflogs between reference backends, maintaining the
original order of the reflog entries is crucial. To achieve this, an
`index` field is stored within the `ref_update` struct that encodes the
relative order of reflog entries. This field is used by the reftable
backend as update index for the respective reflog entries to maintain
that ordering.

These update indices must be respected when writing table headers, which
encode the minimum and maximum update index of contained records in the
header and footer. This logic was added in commit bc67b4ab5f (reftable:
write correct max_update_index to header, 2025-01-15), which started to
use `reftable_writer_set_limits()` to propagate the mininum and maximum
update index of all records contained in a ref transaction.

However, we only set the maximum update index for the first transaction
argument, even though there can be multiple such arguments. This is the
case when we write to multiple stacks in a single transaction, e.g. when
updating references in two different worktrees at once. Consequently,
the update index for all but the first argument remain uninitialized,
which may cause undefined behaviour.

Fix this by moving the assignment of the maximum update index in
`reftable_be_transaction_finish()` inside the loop, which ensures that
all elements of the array are correctly initialized.

Furthermore, initialize the `max_index` field to 0 when queueing a new
transaction argument. This is not strictly necessary, as all elements of
`write_transaction_table_arg.max_index` are now assigned correctly.
However, this initialization is added for consistency and to safeguard
against potential future changes that might inadvertently introduce
uninitialized memory access.

Reported-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-27 08:21:41 -08:00
Karthik Nayak
3519492430 refs: fix creation of reflog entries for symrefs
The commit 297c09eabb (refs: allow multiple reflog entries for the
same refname, 2024-12-16) added logic to exit early in
`lock_ref_for_update()` after obtaining the required lock. This was
added as a performance optimization on a false assumption that no
further processing was required for reflog-only updates.

However the assumption was wrong.  For a symref's reflog entry, the
update needs to be populated with the old_oid value, but the early
exit skipped this necessary step.

This caused a bug in Git 2.48 in the files backend where target
references of symrefs being updated would create a corrupted reflog
entry for the symref since the old_oid is not populated.

Everything the early exit skipped in the code path is necessary for
both regular and symbolic ref, so eliminate the mistaken
optimization, and also add a test to ensure that such an issue
doesn't arise in the future.

Reported-by: Nika Layzell <nika@thelayzells.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-23 09:56:22 -08:00
Karthik Nayak
017bd89239 reftable: prevent 'update_index' changes after adding records
The function `reftable_writer_set_limits()` allows updating the
'min_update_index' and 'max_update_index' of a reftable writer. These
values are written to both the writer's header and footer.

Since the header is written during the first block write, any subsequent
changes to the update index would create a mismatch between the header
and footer values. The footer would contain the newer values while the
header retained the original ones.

To protect against this bug, prevent callers from updating these values
after any record is written. To do this, modify the function to return
an error whenever the limits are modified after any record adds. Check
for record adds within `reftable_writer_set_limits()` by checking the
`last_key` and `next` variable. The former is updated after each record
added, but is reset at certain points. The latter is set after writing
the first block.

Modify all callers of the function to anticipate a return type and
handle it accordingly. Add a unit test to also ensure the function
returns the error as expected.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-22 09:51:36 -08:00
Karthik Nayak
e7c1b9f123 refs: use 'uint64_t' for 'ref_update.index'
The 'ref_update.index' variable is used to store an index for a given
reference update. This index is used to order the updates in a
predetermined order, while the default ordering is alphabetical as per
the refname.

For large repositories with millions of references, it should be safer
to use 'uint64_t'. Let's do that. This also is applied for all other
code sections where we store 'index' and pass it around.

Reported-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-22 09:51:36 -08:00
Junio C Hamano
0f3d8e2e46 Merge branch 'kn/reflog-migration-fix' into kn/reflog-migration-fix-followup
* kn/reflog-migration-fix:
  reftable: write correct max_update_index to header
2025-01-17 15:42:58 -08:00
Karthik Nayak
bc67b4ab5f reftable: write correct max_update_index to header
In 297c09eabb (refs: allow multiple reflog entries for the same refname,
2024-12-16), the reftable backend learned to handle multiple reflog
entries within the same transaction. This was done modifying the
`update_index` for reflogs with multiple indices. During writing the
logs, the `max_update_index` of the writer was modified to ensure the
limits were raised to the modified `update_index`s.

However, since ref entries are written before the modification to the
`max_update_index`, if there are multiple blocks to be written, the
reftable backend writes the header with the old `max_update_index`. When
all logs are finally written, the footer will be written with the new
`min_update_index`. This causes a mismatch between the header and the
footer and causes the reftable file to be corrupted. The existing tests
only spawn a single block and since headers are lazily written with the
first block, the tests didn't capture this bug.

To fix the issue, the appropriate `max_update_index` limit must be set
even before the first block is written. Add a `max_index` field to the
transaction which holds the `max_index` within all its updates, then
propagate this value to the reftable backend, wherein this is used to
the set the `max_update_index` correctly.

Add a test which creates a few thousand reference updates with multiple
reflog entries, which should trigger the bug.

Reported-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-15 09:12:09 -08:00
Junio C Hamano
6f8ae955bd Merge branch 'kn/reflog-migration'
"git refs migrate" learned to also migrate the reflog data across
backends.

* kn/reflog-migration:
  refs: mark invalid refname message for translation
  refs: add support for migrating reflogs
  refs: allow multiple reflog entries for the same refname
  refs: introduce the `ref_transaction_update_reflog` function
  refs: add `committer_info` to `ref_transaction_add_update()`
  refs: extract out refname verification in transactions
  refs/files: add count field to ref_lock
  refs: add `index` field to `struct ref_udpate`
  refs: include committer info in `ref_update` struct
2024-12-23 09:32:29 -08:00
Junio C Hamano
4156b6a741 Merge branch 'ps/build-sign-compare'
Start working to make the codebase buildable with -Wsign-compare.

* ps/build-sign-compare:
  t/helper: don't depend on implicit wraparound
  scalar: address -Wsign-compare warnings
  builtin/patch-id: fix type of `get_one_patchid()`
  builtin/blame: fix type of `length` variable when emitting object ID
  gpg-interface: address -Wsign-comparison warnings
  daemon: fix type of `max_connections`
  daemon: fix loops that have mismatching integer types
  global: trivial conversions to fix `-Wsign-compare` warnings
  pkt-line: fix -Wsign-compare warning on 32 bit platform
  csum-file: fix -Wsign-compare warning on 32-bit platform
  diff.h: fix index used to loop through unsigned integer
  config.mak.dev: drop `-Wno-sign-compare`
  global: mark code units that generate warnings with `-Wsign-compare`
  compat/win32: fix -Wsign-compare warning in "wWinMain()"
  compat/regex: explicitly ignore "-Wsign-compare" warnings
  git-compat-util: introduce macros to disable "-Wsign-compare" warnings
2024-12-23 09:32:11 -08:00
Junio C Hamano
5f212684ab Merge branch 'bf/set-head-symref'
When "git fetch $remote" notices that refs/remotes/$remote/HEAD is
missing and discovers what branch the other side points with its
HEAD, refs/remotes/$remote/HEAD is updated to point to it.

* bf/set-head-symref:
  fetch set_head: handle mirrored bare repositories
  fetch: set remote/HEAD if it does not exist
  refs: add create_only option to refs_update_symref_extended
  refs: add TRANSACTION_CREATE_EXISTS error
  remote set-head: better output for --auto
  remote set-head: refactor for readability
  refs: atomically record overwritten ref in update_symref
  refs: standardize output of refs_read_symbolic_ref
  t/t5505-remote: test failure of set-head
  t/t5505-remote: set default branch to main
2024-12-19 10:58:27 -08:00
Karthik Nayak
297c09eabb refs: allow multiple reflog entries for the same refname
The reference transaction only allows a single update for a given
reference to avoid conflicts. This, however, isn't an issue for reflogs.
There are no conflicts to be resolved in reflogs and when migrating
reflogs between backends we'd have multiple reflog entries for the same
refname.

So allow multiple reflog updates within a single transaction. Also the
reflog creation logic isn't exposed to the end user. While this might
change in the future, currently, this reduces the scope of issues to
think about.

In the reftable backend, the writer sorts all updates based on the
update_index before writing to the block. When there are multiple
reflogs for a given refname, it is essential that the order of the
reflogs is maintained. So add the `index` value to the `update_index`.
The `index` field is only set when multiple reflog entries for a given
refname are added and as such in most scenarios the old behavior
remains.

This is required to add reflog migration support to `git refs migrate`.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-16 09:45:34 -08:00
Karthik Nayak
84675fa271 refs: introduce the ref_transaction_update_reflog function
Introduce a new function `ref_transaction_update_reflog`, for clients to
add a reflog update to a transaction. While the existing function
`ref_transaction_update` also allows clients to add a reflog entry, this
function does a few things more, It:
  - Enforces that only a reflog entry is added and does not update the
  ref itself.
  - Allows the users to also provide the committer information. This
  means clients can add reflog entries with custom committer
  information.

The `transaction_refname_valid()` function also modifies the error
message selectively based on the type of the update. This change also
affects reflog updates which go through `ref_transaction_update()`.

A follow up commit will utilize this function to add reflog support to
`git refs migrate`.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-16 09:45:34 -08:00
Karthik Nayak
4483be36f4 refs: add committer_info to ref_transaction_add_update()
The `ref_transaction_add_update()` creates the `ref_update` struct. To
facilitate addition of reflogs in the next commit, the function needs to
accommodate setting the `committer_info` field in the struct. So modify
the function to also take `committer_info` as an argument and set it
accordingly.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-16 09:45:33 -08:00
Karthik Nayak
611986f300 refs/files: add count field to ref_lock
When refs are updated in the files-backend, a lock is obtained for the
corresponding file path. This is the case even for reflogs, i.e. a lock
is obtained on the reference path instead of the reflog path. This
works, since generally, reflogs are updated alongside the ref.

The upcoming patches will add support for reflog updates in ref
transaction. This means, in a particular transaction we want to have ref
updates and reflog updates. For a given ref in a given transaction there
can be at most one update. But we can theoretically have multiple reflog
updates for a given ref in a given transaction. A great example of this
would be when migrating reflogs from one backend to another. There we
would batch all the reflog updates for a given reference in a single
transaction.

The current flow does not support this, because currently refs & reflogs
are treated as a single entity and capture the lock together. To
separate this, add a count field to ref_lock. With this, multiple
updates can hold onto a single ref_lock and the lock will only be
released when all of them release the lock.

This patch only adds the `count` field to `ref_lock` and adds the logic
to increment and decrement the lock. In a follow up commit, we'll
separate the reflog update logic from ref updates and utilize this
functionality.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-16 09:45:33 -08:00
Karthik Nayak
a3582e2eac refs: add index field to struct ref_udpate
The reftable backend, sorts its updates by refname before applying them,
this ensures that the references are stored sorted. When migrating
reflogs from one backend to another, the order of the reflogs must be
maintained. Add a new `index` field to the `ref_update` struct to
facilitate this.

This field is used in the reftable backend's sort comparison function
`transaction_update_cmp`, to ensure that indexed fields maintain their
order.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-16 09:45:32 -08:00
Karthik Nayak
1a83e26d72 refs: include committer info in ref_update struct
The reference backends obtain the committer information from
`git_committer_info(0)` when adding a reflog. The upcoming patches
introduce support for migrating reflogs between the reference backends.
This requires an interface to creating reflogs, including custom
committer information.

Add a new field `committer_info` to the `ref_update` struct, which is
then used by the reference backends. If there is no `committer_info`
provided, the reference backends default to using
`git_committer_info(0)`. The field itself cannot be set to
`git_committer_info(0)` since the values are dynamic and must be
obtained right when the reflog is being committed.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-16 09:45:32 -08:00
Junio C Hamano
7041902dfa Merge branch 'ps/reftable-iterator-reuse'
Optimize reading random references out of the reftable backend by
allowing reuse of iterator objects.

* ps/reftable-iterator-reuse:
  refs/reftable: reuse iterators when reading refs
  reftable/merged: drain priority queue on reseek
  reftable/stack: add mechanism to notify callers on reload
  refs/reftable: refactor reflog expiry to use reftable backend
  refs/reftable: refactor reading symbolic refs to use reftable backend
  refs/reftable: read references via `struct reftable_backend`
  refs/reftable: figure out hash via `reftable_stack`
  reftable/stack: add accessor for the hash ID
  refs/reftable: handle reloading stacks in the reftable backend
  refs/reftable: encapsulate reftable stack
2024-12-10 10:04:58 +09:00
Junio C Hamano
de9278127e Merge branch 'ps/reftable-detach'
Isolates the reftable subsystem from the rest of Git's codebase by
using fewer pieces of Git's infrastructure.

* ps/reftable-detach:
  reftable/system: provide thin wrapper for lockfile subsystem
  reftable/stack: drop only use of `get_locked_file_path()`
  reftable/system: provide thin wrapper for tempfile subsystem
  reftable/stack: stop using `fsync_component()` directly
  reftable/system: stop depending on "hash.h"
  reftable: explicitly handle hash format IDs
  reftable/system: move "dir.h" to its only user
2024-12-10 10:04:56 +09:00
Patrick Steinhardt
80c9e70ebe global: trivial conversions to fix -Wsign-compare warnings
We have a bunch of loops which iterate up to an unsigned boundary using
a signed index, which generates warnigs because we compare a signed and
unsigned value in the loop condition. Address these sites for trivial
cases and enable `-Wsign-compare` warnings for these code units.

This patch only adapts those code units where we can drop the
`DISABLE_SIGN_COMPARE_WARNINGS` macro in the same step.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-06 20:20:04 +09:00
Patrick Steinhardt
41f43b8243 global: mark code units that generate warnings with -Wsign-compare
Mark code units that generate warnings with `-Wsign-compare`. This
allows for a structured approach to get rid of all such warnings over
time in a way that can be easily measured.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-06 20:20:02 +09:00
Junio C Hamano
b4269ebf35 Merge branch 'sj/refs-symref-referent-fix'
A double-free that may not trigger in practice by luck has been
corrected in the reference resolution code.

* sj/refs-symref-referent-fix:
  ref-cache: fix invalid free operation in `free_ref_entry`
2024-12-06 13:23:16 +09:00
Junio C Hamano
57e81b59f3 Merge branch 'sj/ref-contents-check'
"git fsck" learned to issue warnings on "curiously formatted" ref
contents that have always been taken valid but something Git
wouldn't have written itself (e.g., missing terminating end-of-line
after the full object name).

* sj/ref-contents-check:
  ref: add symlink ref content check for files backend
  ref: check whether the target of the symref is a ref
  ref: add basic symref content check for files backend
  ref: add more strict checks for regular refs
  ref: port git-fsck(1) regular refs check for files backend
  ref: support multiple worktrees check for refs
  ref: initialize ref name outside of check functions
  ref: check the full refname instead of basename
  ref: initialize "fsck_ref_report" with zero
2024-12-04 10:14:42 +09:00
shejialuo
b6318cf23a ref-cache: fix invalid free operation in free_ref_entry
In cfd971520e (refs: keep track of unresolved reference value in
iterators, 2024-08-09), we added a new field "referent" into the "struct
ref" structure. In order to free the "referent", we unconditionally
freed the "referent" by simply adding a "free" statement.

However, this is a bad usage. Because when ref entry is either directory
or loose ref, we will always execute the following statement:

  free(entry->u.value.referent);

This does not make sense. We should never access the "entry->u.value"
field when "entry" is a directory. However, the change obviously doesn't
break the tests. Let's analysis why.

The anonymous union in the "ref_entry" has two members: one is "struct
ref_value", another is "struct ref_dir". On a 64-bit machine, the size
of "struct ref_dir" is 32 bytes, which is smaller than the 48-byte size
of "struct ref_value". And the offset of "referent" field in "struct
ref_value" is 40 bytes. So, whenever we create a new "ref_entry" for a
directory, we will leave the offset from 40 bytes to 48 bytes untouched,
which means the value for this memory is zero (NULL). It's OK to free a
NULL pointer, but this is merely a coincidence of memory layout.

To fix this issue, we now ensure that "free(entry->u.value.referent)" is
only called when "entry->flag" indicates that it represents a loose
reference and not a directory to avoid the invalid memory operation.

Signed-off-by: shejialuo <shejialuo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-27 04:34:37 +09:00
Patrick Steinhardt
7cf65e2660 refs/reftable: reuse iterators when reading refs
When reading references the reftable backend has to:

  1. Create a new ref iterator.

  2. Seek the iterator to the record we're searching for.

  3. Read the record.

We cannot really avoid the last two steps, but re-creating the iterator
every single time we want to read a reference is kind of expensive and a
waste of resources. We couldn't help it in the past though because it
was not possible to reuse iterators. But starting with 5bf96e0c39
(reftable/generic: move seeking of records into the iterator,
2024-05-13) we have split up the iterator lifecycle such that creating
the iterator and seeking are two different concerns.

Refactor the code such that we cache iterators in the reftable backend.
This cache is invalidated whenever the respective stack is reloaded such
that we know to recreate the iterator in that case. This leads to a
sizeable speedup when creating many refs, which requires a lot of random
reference reads:

    Benchmark 1: update-ref: create many refs (refcount = 100000, revision = master)
      Time (mean ± σ):      1.793 s ±  0.010 s    [User: 0.954 s, System: 0.835 s]
      Range (min … max):    1.781 s …  1.811 s    10 runs

    Benchmark 2: update-ref: create many refs (refcount = 100000, revision = HEAD)
      Time (mean ± σ):      1.680 s ±  0.013 s    [User: 0.846 s, System: 0.831 s]
      Range (min … max):    1.664 s …  1.702 s    10 runs

    Summary
      update-ref: create many refs (refcount = 100000, revision = HEAD) ran
        1.07 ± 0.01 times faster than update-ref: create many refs (refcount = 100000, revision = master)

While 7% is not a huge win, you have to consider that the benchmark is
_writing_ data, so _reading_ references is only one part of what we do.
Flame graphs show that we spend around 40% of our time reading refs, so
the speedup when reading refs is approximately ~2.5x that. I could not
find better benchmarks where we perform a lot of random ref reads.

You can also see a sizeable impact on memory usage when creating 100k
references. Before this change:

    HEAP SUMMARY:
        in use at exit: 19,112,538 bytes in 200,170 blocks
      total heap usage: 8,400,426 allocs, 8,200,256 frees, 454,367,048 bytes allocated

After this change:

    HEAP SUMMARY:
        in use at exit: 674,416 bytes in 169 blocks
      total heap usage: 7,929,872 allocs, 7,929,703 frees, 281,509,985 bytes allocated

As an additional factor, this refactoring opens up the possibility for
more performance optimizations in how we re-seek iterators. Any change
that allows us to optimize re-seeking by e.g. reusing data structures
would thus also directly speed up random reads.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-26 17:18:38 +09:00
Patrick Steinhardt
96e7cb83b6 refs/reftable: refactor reflog expiry to use reftable backend
Refactor the callback function that expires reflog entries in the
reftable backend to use `reftable_backend_read_ref()` instead of
accessing the reftable stack directly. This ensures that the function
will benefit from the new caching layer that we're about to introduce.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-26 17:18:37 +09:00
Patrick Steinhardt
ad6c41f4b7 refs/reftable: refactor reading symbolic refs to use reftable backend
Refactor the callback function that reads symbolic references in the
reftable backend to use `reftable_backend_read_ref()` instead of
accessing the reftable stack directly. This ensures that the function
will benefit from the new caching layer that we're about to introduce.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-26 17:18:37 +09:00
Patrick Steinhardt
27fdf8f4ed refs/reftable: read references via struct reftable_backend
Refactor `read_ref_without_reload()` to accept `struct reftable_backend`
as parameter instead of `struct reftable_stack`. Rename the function to
`reftable_backend_read_ref()` to clarify its scope and move it close to
other functions operating on `struct reftable_backend`.

This change allows us to implement an additional caching layer when
reading refs where we can reuse reftable iterators.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-26 17:18:37 +09:00
Patrick Steinhardt
3ec8022bb0 refs/reftable: figure out hash via reftable_stack
The function `read_ref_without_reload()` accepts a ref store as input
only so that we can figure out the hash function used by it. This is
duplicate information though because the reftable stack knows about its
hash function, too.

Drop the superfluous parameter to simplify the calling convention a bit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-11-26 17:18:37 +09:00