State Tree Pruning | Ethereum Foundation Blog

One of the key issues raised during the release of Olympic Stress Net is the large amount of data that clients need to store. After just over three months of operation, and especially over the last month, the amount of data in each Ethereum client’s blockchain folder has grown to between 10 and 40 gigabytes, depending on the client in use and whether compression is enabled. Ta. . It is important to note that this is indeed a stress testing scenario, but users are incentivized to only pay free test ether as transaction fees and dump their transactions onto the blockchain, thus The throughput level is several times that of Bitcoin. Nevertheless, it’s a legitimate concern for users who often don’t have hundreds of gigabytes to spare to store other people’s transaction history.

Let’s start by exploring why the current Ethereum client database is so large. Unlike Bitcoin, Ethereum has the property that every block contains what is called a “state root”, or state root hash. special type of merkle tree It stores the entire system state. All account balances, contract storage, contract codes, and account nonces are inside.

The purpose is simple. This allows a node, given only the last block, to very quickly update the blockchain by simply downloading it, without having to process any historical transactions, with some guarantee that that last block is actually the most recent block. You will be able to “synchronize” with. The rest of the tree (proposed) from the nodes in the network hash lookup wire protocol message (to make this easier), verify that the tree is correct by making sure all the hashes match, and proceed from there. In a fully decentralized context, this would likely be done through an advanced version of Bitcoin’s header-first validation strategy, which would roughly look like this:

  1. Download as many block headers as are available to the client.
  2. Determine the header at the end of the longest chain. Starting from that header, go back 100 blocks to be safe and call the block at that position P.100(H) (“Head of the family’s 100th generation of grandparents”)
  3. Download the state tree from the state root of P.100(H), using hash lookup opcode (note that after the first round or two, this can be parallelized across as many peers as needed). Make sure all parts of the tree match.
  4. From there, proceed as usual.

For light clients, state roots are even more advantageous. Instantly determine the exact balance and status of your account by simply making a request to the network. specific branch There is no need to follow Bitcoin’s multi-step 1-of-N “request all transaction outputs, then request all transactions that consume those outputs, and get the rest” light client model.

However, this state tree mechanism has significant drawbacks when implemented simply. Intermediate nodes in the tree significantly increase the amount of disk space required to store all your data. To see why, consider the following diagram.

Changes to the tree during an individual block are fairly small, and the appeal of the tree as a data structure is that most data can simply be referenced twice without being copied. However, you still need to save logarithmically many nodes (i.e. up to 5 for 1000 nodes, up to 10 for 1000000 nodes, and up to 15 for 1000000000 nodes) twice in one version every time a change is made to the state. there is. One version for the old tree and one for the new try. Eventually, a node will process all the blocks, so in computer science terms, you can expect the total disk space utilization to be approximately: O(n*log(n))where n transaction load. In reality, the Ethereum blockchain is only 1.3 gigabytes, but the database size with all these additional nodes is between 10 and 40 gigabytes.

So what can you do? One backwards fix is ​​to simply implement header-first synchronization. This basically resets the hard disk consumption to zero for new users and allows the user to keep the hard disk consumption low by resynchronizing her every 1-2 months. A rather ugly solution. An alternative approach is to implement it like this: Pruning the state tree: Basically, use reference count A node in a tree (I use the computer science term “node” here to mean “a piece of data somewhere in a graph or tree structure” rather than “a computer on a network”) is a tree. to track when it drops out of the node, and at that point the node is placed on “death row”. However, unless that node is somehow used again within the next period, X Blocks (e.g. X = 5000), after that number of blocks, the node must be permanently removed from the database. Basically, it stores tree nodes that are part of the current state, and also stores recent history, but it doesn’t store history older than 5000 blocks.

X It should be set as low as possible to save space, but X If it is too low, robustness will be compromised. Once this technique is implemented, nodes cannot be undone beyond the following ranges: X It basically blocks the sync without fully restarting it. Now let’s see how to fully implement this approach, taking into account all special cases.

  1. When processing numbered blocks N, Keep track of all nodes (in the state,tree, and receive tree) whose reference count becomes zero. We put the hashes of these nodes into a “death row” database of some kind of data structure, so that we can later call up the list by block number (specifically block number). N+X), marks the node database entry itself as worthy of deletion in the block. N+X.
  2. If a node on death row comes back to life (a practical example of this is when account A gets a certain balance/nonce/code/storage combination) fthen switch to another value gthen account B gets the state f On the other hand, the node f is on death row), set the reference count back to 1.If that node is removed again in a future block M (and M > N), put it back on death row in a future block and delete it in the block M+X.
  3. Once the processing block is reached N+Xrecall the list of hashes logged back during the block. N. Check the nodes associated with each hash.If the node is still marked for deletion during that particular block (i.e. not restored and, importantly, marked for deletion again without being restored) later), erase. It also deletes the list of hashes in the death row database.
  4. In some cases, the new head of the chain is not on top of the previous one and the block needs to be moved back. In such cases, a journal of all changes to the reference count must be kept in the database (this is a “journal” in the following cases): journaling file system; essentially an ordered list of changes made). When reverting a block, remove the death row list generated when that block was created and undo changes made according to the journal (delete the journal when done).
  5. When processing a block, delete the block’s journal N-X; more cannot be undone X Since it’s a block anyway, you don’t need a journal (keeping a journal would actually defeat the whole purpose of pruning).

Once this is complete, the database will only contain state nodes associated with the last node. X Because they are blocks, you can get all the information you need from those blocks, but no more. In addition to this, further optimizations are performed.especially after X Blocks, transactions, and receipt trees should be completely deleted, and perhaps even blocks could be deleted as well, but there should be an “archive” that stores absolutely everything so that the rest of the network can retrieve it. There is also an important argument to keep a subset of the data you need.

So how much will this save you? As it turns out, it’s quite a lot! Especially if you choose the ultimate daredevil route. X = 0 (i.e. the ability to handle even single block forks is completely lost and no history is stored at all), in which case the database size is essentially the size of the state.So it’s still this value (this data was captured in block 670000) is about 40 megabytes, most of which is Account like this If your storage slots are full to intentionally spam your X = 100000Most of the increase will occur in the last 100,000 blocks, and the additional space required to store the journal and death row list will make up for the remainder of the delta, resulting in essentially a current size of 10 to 40 gigabytes . For all values ​​in between, you can expect disk capacity growth to be linear (i.e. X = 10000 That way you’ll be close to zero about 90% of the way.)

Note that you may want to pursue a hybrid strategy. block but not all state tree node; In this case, you will need to add approximately 1.4 GB to store block data. It is important to note that the size of a blockchain is not caused by the speed of block times. Block headers from the past three months currently account for about 300 megabytes, with the rest being transactions from the past month, so we expect transactions to continue to dominate when usage is high. That said, for light clients to survive low memory situations, they also need to prune block headers.

The strategy described above was implemented in a very early alpha version. Pais; This storage bloat is a medium-term issue, not a short-term scalability issue, and will be properly implemented for all clients at an appropriate time after Frontier is released.

Related Article


Leave a Comment