Commit Graph

32 Commits

Author SHA1 Message Date
Junio C Hamano
67ce50ba26 Merge branch 'ps/reftable-reusable-iterator'
Code clean-up to make the reftable iterator closer to be reusable.

* ps/reftable-reusable-iterator:
  reftable/merged: adapt interface to allow reuse of iterators
  reftable/stack: provide convenience functions to create iterators
  reftable/reader: adapt interface to allow reuse of iterators
  reftable/generic: adapt interface to allow reuse of iterators
  reftable/generic: move seeking of records into the iterator
  reftable/merged: simplify indices for subiterators
  reftable/merged: split up initialization and seeking of records
  reftable/reader: set up the reader when initializing table iterator
  reftable/reader: inline `reader_seek_internal()`
  reftable/reader: separate concerns of table iter and reftable reader
  reftable/reader: unify indexed and linear seeking
  reftable/reader: avoid copying index iterator
  reftable/block: use `size_t` to track restart point index
2024-05-30 14:15:12 -07:00
Patrick Steinhardt
0e7be2b3ea reftable/reader: adapt interface to allow reuse of iterators
Refactor the interfaces exposed by `struct reftable_reader` and `struct
table_iterator` such that they support iterator reuse. This is done by
separating initialization of the iterator and seeking on it.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:18 -07:00
Patrick Steinhardt
5bf96e0c39 reftable/generic: move seeking of records into the iterator
Reftable iterators are created by seeking on the parent structure of a
corresponding record. For example, to create an iterator for the merged
table you would call `reftable_merged_table_seek_ref()`. Most notably,
it is not posible to create an iterator and then seek it afterwards.

While this may be a bit easier to reason about, it comes with two
significant downsides. The first downside is that the logic to find
records is split up between the parent data structure and the iterator
itself. Conceptually, it is more straight forward if all that logic was
contained in a single place, which should be the iterator.

The second and more significant downside is that it is impossible to
reuse iterators for multiple seeks. Whenever you want to look up a
record, you need to re-create the whole infrastructure again, which is
quite a waste of time. Furthermore, it is impossible to optimize seeks,
such as when seeking the same record multiple times.

To address this, we essentially split up the concerns properly such that
the parent data structure is responsible for setting up the iterator via
a new `init_iter()` callback, whereas the iterator handles seeks via a
new `seek()` callback. This will eventually allow us to call `seek()` on
the iterator multiple times, where every iterator can potentially
optimize for certain cases.

Note that at this point in time we are not yet ready to reuse the
iterators. This will be left for a future patch series.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:18 -07:00
Patrick Steinhardt
c82692f755 reftable/reader: set up the reader when initializing table iterator
All the seeking functions accept a `struct reftable_reader` as input
such that they can use the reader to look up the respective blocks.
Refactor the code to instead set up the reader as a member of `struct
table_iter` during initialization such that we don't have to pass the
reader on every single call.

This step is required to move seeking of records into the generic
`struct reftable_iterator` infrastructure.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:17 -07:00
Patrick Steinhardt
f1e3c12196 reftable/reader: inline reader_seek_internal()
We have both `reader_seek()` and `reader_seek_internal()`, where the
former function only exists so that we can exit early in case the given
table has no records of the sought-after type.

Merge these two functions into one.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:17 -07:00
Patrick Steinhardt
81a03a3236 reftable/reader: separate concerns of table iter and reftable reader
In "reftable/reader.c" we implement two different interfaces:

  - The reftable reader contains the logic to read reftables.

  - The table iterator is used to iterate through a single reftable read
    by the reader.

The way those two types are used in the code is somewhat confusing
though because seeking inside a table is implemented as if it was part
of the reftable reader, even though it is ultimately more of a detail
implemented by the table iterator.

Make the boundary between those two types clearer by renaming functions
that seek records in a table such that they clearly belong to the table
iterator's logic.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:17 -07:00
Patrick Steinhardt
dfdd1455bb reftable/reader: unify indexed and linear seeking
In `reader_seek_internal()` we either end up doing an indexed seek when
there is one or a linear seek otherwise. These two code paths are
disjunct without a good reason, where the indexed seek will cause us to
exit early.

Refactor the two code paths such that it becomes possible to share a bit
more code between them.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:16 -07:00
Patrick Steinhardt
9a59b65dba reftable/reader: avoid copying index iterator
When doing an indexed seek we need to walk down the multi-level index
until we finally hit a record of the desired indexed type. This loop
performs a copy of the index iterator on every iteration, which is both
hard to understand and completely unnecessary.

Refactor the code so that we use a single iterator to walk down the
indices, only.

Note that while this should improve performance, the improvement is
negligible in all but the most unreasonable repositories. This is
because the effect is only really noticeable when we have to walk down
many levels of indices, which is not something that a repository would
typically have. So the motivation for this change is really only about
readability.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:04:16 -07:00
Patrick Steinhardt
fcf341890e reftable/dump: support dumping a table's block structure
We're about to introduce new configs that will allow users to have more
control over how exactly reftables are written. To verify that these
configs are effective we will need to take a peak into the actual blocks
written by the reftable backend.

Introduce a new mode to the dumping logic that prints out the block
structure. This logic can be invoked via `test-tool dump-reftables -b`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-13 17:02:38 -07:00
Patrick Steinhardt
ce1f213cc9 reftable/block: reuse zstream state on inflation
When calling `inflateInit()` and `inflate()`, the zlib library will
allocate several data structures for the underlying `zstream` to keep
track of various information. Thus, when inflating repeatedly, it is
possible to optimize memory allocation patterns by reusing the `zstream`
and then calling `inflateReset()` on it to prepare it for the next chunk
of data to inflate.

This is exactly what the reftable code is doing: when iterating through
reflogs we need to potentially inflate many log blocks, but we discard
the `zstream` every single time. Instead, as we reuse the `block_reader`
for each of the blocks anyway, we can initialize the `zstream` once and
then reuse it for subsequent inflations.

Refactor the code to do so, which leads to a significant reduction in
the number of allocations. The following measurements were done when
iterating through 1 million reflog entries. Before:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 23,028 allocs, 22,906 frees, 162,813,552 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 302 allocs, 180 frees, 88,352 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
Patrick Steinhardt
dd347bbce6 reftable/block: reuse uncompressed blocks
The reftable backend stores reflog entries in a compressed format and
thus needs to uncompress blocks before one can read records from it.
For each reflog block we thus have to allocate an array that we can
decompress the block contents into. This block is being discarded
whenever the table iterator moves to the next block. Consequently, we
reallocate a new array on every block, which is quite wasteful.

Refactor the code to reuse the uncompressed block data when moving the
block reader to a new block. This significantly reduces the number of
allocations when iterating through many compressed blocks. The following
measurements are done with `git reflog list` when listing 100k reflogs.
Before:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 45,755 allocs, 45,633 frees, 254,779,456 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,473 bytes in 122 blocks
    total heap usage: 23,028 allocs, 22,906 frees, 162,813,547 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
Patrick Steinhardt
b00bcb7c49 reftable/reader: iterate to next block in place
The table iterator has to iterate towards the next block once it has
yielded all records of the current block. This is done by creating a new
table iterator, initializing it to the next block, releasing the old
iterator and then copying over the data.

Refactor the code to instead advance the table iterator in place. This
is simpler and unlocks some optimizations in subsequent patches. Also,
it allows us to avoid some allocations.

The following measurements show a single matching ref out of 1 million
refs. Before this change:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 315 allocs, 190 frees, 107,027 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
Patrick Steinhardt
bcdc586db0 reftable/block: move ownership of block reader into struct table_iter
The table iterator allows the caller to iterate through all records in a
reftable table. To do so it iterates through all blocks of the desired
type one by one, where for each block it creates a new block iterator
and yields all its entries.

One of the things that is somewhat confusing in this context is who owns
the block reader that is being used to read the blocks and pass them to
the block iterator. Intuitively, as the table iterator is responsible
for iterating through the blocks, one would assume that this iterator is
also responsible for managing the lifecycle of the reader. And while it
somewhat is, the block reader is ultimately stored inside of the block
iterator.

Refactor the code such that the block reader is instead fully managed by
the table iterator. Instead of passing the reader to the block iterator,
we now only end up passing the block data to it. Despite clearing up the
lifecycle of the reader, it will also allow for better reuse of the
reader in subsequent patches.

The following benchmark prints a single matching ref out of 1 million
refs. Before:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 6,607 allocs, 6,482 frees, 509,635 bytes allocated

After:

  HEAP SUMMARY:
      in use at exit: 13,603 bytes in 125 blocks
    total heap usage: 7,235 allocs, 7,110 frees, 301,481 bytes allocated

Note that while there are more allocation and free calls now, the
overall number of bytes allocated is significantly lower. The number of
allocations will be reduced significantly by the next patch though.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
Patrick Steinhardt
b371221a60 reftable/block: introduce block_reader_release()
Introduce a new function `block_reader_release()` that releases
resources acquired by the block reader. This function will be extended
in a subsequent commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
Patrick Steinhardt
42c7bdc36d reftable/block: merge block_iter_seek() and block_reader_seek()
The function `block_iter_seek()` is merely a simple wrapper around
`block_reader_seek()`. Merge those two functions into a new function
`block_iter_seek_key()` that more clearly says what it is actually
doing.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
Patrick Steinhardt
3122d44025 reftable/block: rename block_reader_start()
The function `block_reader_start()` does not really apply to the block
reader, but to the block iterator. It's name is thus somewhat confusing.
Rename it to `block_iter_seek_start()` to clarify.

We will rename `block_reader_seek()` in similar spirit in the next
commit.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-15 10:36:09 -07:00
Junio C Hamano
9f67cbd0a7 Merge branch 'ps/reftable-iteration-perf'
The code to iterate over refs with the reftable backend has seen
some optimization.

* ps/reftable-iteration-perf:
  reftable/reader: add comments to `table_iter_next()`
  reftable/record: don't try to reallocate ref record name
  reftable/block: swap buffers instead of copying
  reftable/pq: allocation-less comparison of entry keys
  reftable/merged: skip comparison for records of the same subiter
  reftable/merged: allocation-less dropping of shadowed records
  reftable/record: introduce function to compare records by key
2024-02-26 18:10:24 -08:00
Junio C Hamano
f424d7c33d Merge branch 'ps/reftable-styles'
Code clean-up in various reftable code paths.

* ps/reftable-styles:
  reftable/record: improve semantics when initializing records
  reftable/merged: refactor initialization of iterators
  reftable/merged: refactor seeking of records
  reftable/stack: use `size_t` to track stack length
  reftable/stack: use `size_t` to track stack slices during compaction
  reftable/stack: index segments with `size_t`
  reftable/stack: fix parameter validation when compacting range
  reftable: introduce macros to allocate arrays
  reftable: introduce macros to grow arrays
2024-02-12 13:16:10 -08:00
Patrick Steinhardt
c68ca7abd3 reftable/reader: add comments to table_iter_next()
While working on the optimizations in the preceding patches I stumbled
upon `table_iter_next()` multiple times. It is quite easy to miss the
fact that we don't call `table_iter_next_in_block()` twice, but that the
second call is in fact `table_iter_next_block()`.

Add comments to explain what exactly is going on here to make things
more obvious. While at it, touch up the code to conform to our code
style better.

Note that one of the refactorings merges two conditional blocks into
one. Before, we had the following code:

```
err = table_iter_next_block(&next, ti);
if (err != 0) {
	ti->is_finished = 1;
}
table_iter_block_done(ti);
if (err != 0) {
	return err;
}
```

As `table_iter_block_done()` does not care about `is_finished`, the
conditional blocks can be merged into one block:

```
err = table_iter_next_block(&next, ti);
table_iter_block_done(ti);
if (err != 0) {
	ti->is_finished = 1;
	return err;
}
```

This is both easier to reason about and more performant because we have
one branch less.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-12 09:19:27 -08:00
Patrick Steinhardt
3ddef475d0 reftable/record: improve semantics when initializing records
According to our usual coding style, the `reftable_new_record()`
function would indicate that it is allocating a new record. This is not
the case though as the function merely initializes records without
allocating any memory.

Replace `reftable_new_record()` with a new `reftable_record_init()`
function that takes a record pointer as input and initializes it
accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-06 12:10:09 -08:00
Patrick Steinhardt
b4ff12c8ee reftable: introduce macros to allocate arrays
Similar to the preceding commit, let's carry over macros to allocate
arrays with `REFTABLE_ALLOC_ARRAY()` and `REFTABLE_CALLOC_ARRAY()`. This
requires us to change the signature of `reftable_calloc()`, which only
takes a single argument right now and thus puts the burden on the caller
to calculate the final array's size. This is a net improvement though as
it means that we can now provide proper overflow checks when multiplying
the array size with the member size.

Convert callsites of `reftable_calloc()` to the new signature and start
using the new macros where possible.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-06 12:10:08 -08:00
Patrick Steinhardt
4950acae7d reftable: document reading and writing indices
The way the index gets written and read is not trivial at all and
requires the reader to piece together a bunch of parts to figure out how
it works. Add some documentation to hopefully make this easier to
understand for the next reader.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-01 11:11:33 -08:00
Patrick Steinhardt
d55fc5128b reftable/reader: be more careful about errors in indexed seeks
When doing an indexed seek we first need to do a linear seek in order to
find the index block for our wanted key. We do not check the returned
error of the linear seek though. This is likely not an issue because the
next call to `table_iter_next()` would return error, too. But it very
much is a code smell when an error variable is being assigned to without
actually checking it.

Safeguard the code by checking for errors.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-01 11:11:32 -08:00
Junio C Hamano
492ee03f60 Merge branch 'en/header-cleanup'
Remove unused header "#include".

* en/header-cleanup:
  treewide: remove unnecessary includes in source files
  treewide: add direct includes currently only pulled in transitively
  trace2/tr2_tls.h: remove unnecessary include
  submodule-config.h: remove unnecessary include
  pkt-line.h: remove unnecessary include
  line-log.h: remove unnecessary include
  http.h: remove unnecessary include
  fsmonitor--daemon.h: remove unnecessary includes
  blame.h: remove unnecessary includes
  archive.h: remove unnecessary include
  treewide: remove unnecessary includes in source files
  treewide: remove unnecessary includes from header files
2024-01-08 14:05:15 -08:00
Elijah Newren
eea0e59ffb treewide: remove unnecessary includes in source files
Each of these were checked with
   gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE}
to ensure that removing the direct inclusion of the header actually
resulted in that header no longer being included at all (i.e. that
no other header pulled it in transitively).

...except for a few cases where we verified that although the header
was brought in transitively, nothing from it was directly used in
that source file.  These cases were:
  * builtin/credential-cache.c
  * builtin/pull.c
  * builtin/send-pack.c

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26 12:04:31 -08:00
Patrick Steinhardt
a8305bc6d8 reftable/block: introduce macro to initialize struct block_iter
There are a bunch of locations where we initialize members of `struct
block_iter`, which makes it harder than necessary to expand this struct
to have additional members. Unify the logic via a new `BLOCK_ITER_INIT`
macro that initializes all members.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-11 07:23:17 -08:00
Jeff King
21a40847ed reftable: drop unused parameter from reader_seek_linear()
The reader code passes around a "struct reftable_reader" context
variable. But the seek function doesn't need it; the table iterator we
already get is sufficient.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-08-20 14:14:55 -07:00
Han-Wen Nienhuys
eff5832ba1 reftable: reject 0 object_id_len
The spec says 2 <= object_id_len <= 31. We are lenient and allow 1,
but we forbid 0, so we can be sure that we never read a 0-length key.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-23 13:36:26 -08:00
Han-Wen Nienhuys
66c0dabab5 reftable: make reftable_record a tagged union
This reduces the amount of glue code, because we don't need a void
pointer or vtable within the structure.

The only snag is that reftable_index_record contain a strbuf, so it
cannot be zero-initialized. To address this, use reftable_new_record()
to return fresh instance, given a record type. Since
reftable_new_record() doesn't cause heap allocation anymore, it should
be balanced with reftable_record_release() rather than
reftable_record_destroy().

Thanks to Peff for the suggestion.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-20 11:31:53 -08:00
Han-Wen Nienhuys
33e9224320 reftable: all xxx_free() functions accept NULL arguments
This fixes NULL derefs in error paths. Spotted by Coverity.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-20 11:31:52 -08:00
Han-Wen Nienhuys
24d4d38c0b reftable: fix resource leak in block.c error path
Add test coverage for corrupt zlib data. Fix memory leaks demonstrated by
unittest.

This problem was discovered by a Coverity scan.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-20 11:31:52 -08:00
Han-Wen Nienhuys
46bc0e731a reftable: read reftable files
This supports reading a single reftable file.

The commit introduces an abstract iterator type, which captures the usecases
both of reading individual refs, and iterating over a segment of the ref
namespace.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-08 10:45:48 -07:00