5 min read

Breaking 10 Million TPS: The Hardware-Software Co-Design Behind Line-Speed SigVerify

Written by

Eclipse Labs

Published on

June 20, 2025

Our signature-verification engine shatters expectations, hitting nearly 4 million TPS on an off-the-shelf desktop and exceeding 9 million TPS on a commodity server, all while comfortably saturating common network links (10 Gbps—and even 65 Gbps—of inbound traffic). By pushing verification throughput beyond line-speed, we effectively eliminate signature checking as a bottleneck for high-performance blockchains like Eclipse, Solana, and beyond.

Introduction

Signature verification is an essential part of every public blockchain. Blockchains such as Eclipse, Solana, and Ethereum rely on public key cryptography to authenticate transactions Every time you or anyone else needs to interact with the blockchain, the first question is whether the signature of the transaction is valid.

Every transaction, no matter how small, needs to be signed by the user, and every L1 and L2 (and L3) blockchain needs to verify the signature on its end. Often, if the transactions are simple (e.g. transfers), the amount of time spent verifying your signature is greater than the amount of time it takes to process the transaction itself.

We need to define a metric of how fast a blockchain such as Eclipse is at verifying signatures. We do this by introducing the notion of line-speed processing.

Line-Speed Processing

We start with the question of how fast is fast enough? Saying we want to process one million TPS is all good and well, until you realize that it might not be possible to ingest that many transactions from the network. As such, the network speed is where we need to start.

Figure 1: the rough outline of line rate standards and the years of their introduction. Note that the existence of a standard does not imply the existence of an implementation.

‍

It’s much more important to ask how many transactions (and therefore signatures) can theoretically arrive? A Solana transaction is typically between 176 and 1,232 bytes. We know how many bytes can arrive at maximum load; as many as the network connection hardware can support. Usually this is one of the line rate standards listed in Figure 1. So it’s either 1 Gbps or 10 Gbps or --- in an extreme case --- 1.6Tbps. These, of course, are the fastest networks typically used in a data center context. From these network speeds we can work out an upper limit on the number of signatures per second.

Each Solana transaction, at a minimum, comes with one signature, touches one account, encodes a recent block hash, and contains one instruction. This is approximately 176 bytes, which we find by adding the minimal lengths of all objects that necessarily must be in a non-trivial (i.e. no-op) transaction. This is slightly more than one Gbps assuming one million TPS; this represents, for example, a situation where there is a wave of arbitrage bots trying to sell a rapidly depreciating token.

For the upper bound, suppose, for simplicity (and because this number is close to the upper limit of 1,232 bytes) that each such message is around 1,024 bytes. This means that you get approximately one gigabyte per second of traffic or around 8 Gbps. This is a number to keep in mind to evaluate the resistance to DoS (denial of service) attacks.

💡 The message headers (48 bytes) sent over UDP plus the data must be no more than the IPV6 MTU, which is 1,280 bytes.

Consequently, a constant stream of transactions at 1 Gbps rate is between 101,461 TPS and 710,228 TPS; 10 Gbps gives us 1 to 7 million transactions, 100 Gbps gives us 10 to 70 and so on.

Hardware-Software Co-Design

A wide variety of recent work showcases how to combine clever algorithmic ideas from cryptography with more esoteric hardware choices. Specifically, we have seen projects that utilize a combination of SIMD processing, GPUs, FPGAs and ASICs, all for the purpose of ed25519 signing and verification [Zhang 2024, Smet 2024, Banerjee 2024, Gao 2024], etc. We describe some of the most up-to-date references below, under Related work.

From the raw performance standard, an ASIC would have the best throughput and energy efficiency. Banerjee et.al. [Banerjee 2024] showcases an ASIC implementation that is optimized for single signature verification latency but also demonstrates good throughput. Their main drawback is cost and difficulties in fixing security vulnerabilities.

Related work supports that GPUs and FPGAs are not clearly superior to one another; they are both excellent options, with a proven track record in blockchains, high frequency trading, scientific computations and AI.

SIMD was designed to process large amounts of data for scientific computing, aiming to increase throughput with high amounts of data-parallel processing. SIMD is also widely available, being standard on consumer- and server-grade hardware.

In contrast, multi-threading is an approach that is largely limited to providing linear scaling with the number of cores at best. It also often scales significantly worse than linearly, and requires a great deal of care to execute correctly. Note that SIMD and general purpose CPU implementations are orthogonal to multi-threading, and can be combined for this specific problem. However, given the choice between wider SIMD and more threads, we would prefer to work on a machine that has wider registers than one with more physical cores, in part because SIMD avoids the overhead of context switching and OS scheduling.

Later sections will discuss which other types of hardware we could use and how, but for the moment, let us focus specifically on vector extensions and how to best utilize them.

SIMD: Architecture

Our signature verification library is modeled after the standards established by curve25519-dalek, the elliptic curve library currently used by Solana and the Agave validator. Our implementation must pass all of their tests before it goes into production.

However, we have used the from-scratch nature of our implementation to relax some of the constraints. The side-channel protection matters for signing much more than it does for verification. Optimizing for throughput may lead to worse latencies for smaller batches, which we can afford to tolerate. Furthermore, we have completely relaxed all of the constraints on memory usage. This led to us introducing caching in a few key areas, resulting in significant speedups.

Data layout

Our implementation uses a completely different data structure to store a batch of signatures. Instead of storing the full signature in one vector (AVX-512) register, the public key in another register, we store the data transposed: a part of the public key, a part of the message and a part of the signature are stored in different lanes.

Figure 2: A schematic representation of the layout of data in SIMD.

‍

This gives us an opportunity to describe operations chunk-by chunk, within one vector register. This gives us comparable speed to curve25519-dalek in terms of Edwards point operations.

Furthermore, we implemented a SHA-512 hash function leveraging SIMD. This allows us to verify signatures for larger messages with minimal extra cost. More on that in part 2.

Another optimization (that is pretty much standard across performance-tuned implementations) is the usage of 51-bit limbs to represent the segments of the field elements. The main advantage is that we can use fused multiplication-addition hardware designed for 64-bit floating point (double precision) numbers. Additionally, carry propagation is simpler: we know which operations cannot produce a carry.

Another optimization is the pervasive use of caching. One of the reasons this is not done in e.g. curve25519-dalek is because of side-channel resistance for the verification process. Caching this information in memory would basically defeat the purpose of much of the clever engineering that went into dalek. In our case, however, we can’t leak information, because the information is already available over the RPC. And because this is the case, all the information that you would be able to obtain by e.g. reading the caches for the intermediate computations, you can simply obtain by running the computation on the same inputs. In our case, however, this nearly doubles the throughput of certain operations.

Implementation details

The library itself is independent of any Rust crate, and can be compiled in no_std environments. It uses no procedural macros and has a C-ABI interface. These were deliberate design decisions to avoid dependency hell issues plaguing Rust crypto. This bottom-up, minimal dependency approach is a key style choice.

It is usually not a good idea to rewrite cryptographic libraries. However, a careful examination of the external factors that led to specific design decisions, and an intentional rewriting of many of the internals coupled with judicious sanity-checking, via tests and formal verification can often lead to a better outcome than an off-the-shelf component.

Case in point, we could have used the standard unroll crate. We chose not to for several reasons. We did not want to introduce a dependency on syn and quote, which due to the nature of procedural macros, while not themselves hard to compile, slow down the Rust compilation process. Another reason was that unroll did something subtly different from what we wanted, thus resulting in less efficient disassembly, due to bounds checks that we could eliminate.

Finally, we make heavy use of the unstable Rust feature of portable SIMD. We have identified several issues with the implementation and have worked with the Rust and LLVM teams to resolve the issues on ARM, particularly regarding instructions optimizing SHA-family of hash computations.

Experimental results

Test hardware

We present two main results on the AMD Zen 5 architecture. Our first machine is a RYZEN9 9950x consumer-grade workstation. It is a reasonably-priced machine with sufficient processing power to handle a load equivalent to a 1 Gbps line saturated with transactions (1,000,000 TPS). It is also a machine that can be added to a working server as a co-processor to handle the extra load. This specific hardware configuration is common by Solana L1 validator standards.

Our benchmarks use a mix of public keys that mirrors real-world traffic on Eclipse and Solana. Drawing on our performance thesis, we can apply simple statistical techniques to identify “hot” keys in advance and pre-load them into cache, reducing contention at runtime.

As shown in Figures 3a and 3b, we achieve a throughput close to 1M TPS on the RYZEN desktop machine. Below we dissect the performance as we grow the number of lanes and threads.

Machine	CPU	Cores / Threads	RAM	Networking speed	Storage	OS	Kernel
Desktop	RYZEN 9 9950x	16 / 32	192 GiB	10 Gbps	PCI-E Gen 5	Arch Linux	6.15.3
Server	EPYC 9575F	64 / 128	512 GiB	40 Gbps	PCI-E Gen 4	Ubuntu	6.11.1

‍

Scaling with the number of SIMD lanes

We begin our discussion by considering the efficiency of the SIMD implementation. Specifically, we observe in Figure 3, that regardless of the number of threads allocated, the scaling in throughput is linear in the number of SIMD lanes.

Each lane and each CPU core contribute a throughput of 40 MB/sec or around 40,000 TPS; this is what we will call perfect scaling as shown in Figure 3 with a solid black line. Figure 3a shows us what the scaling looks like for a single software thread, while Figure 3b demonstrates all 16 of the SIMD banks demonstrating a roughly 16-fold increase in the throughput.

Figure 3a: Scaling of TPS with the number of lanes on one RYZEN 9 9950x thread

‍

Figure 3b: Scaling of TPS with the number of lanes on all RYZEN 9 9950x threads

Scaling with Multiple Threads

The question of how well our library scales with multiple threads is less straightforward than the SIMD lane scaling. There isn’t a one-to-one mapping between physical CPU cores and the number of threads. Further the operating system: Linux, is likely to assign some of the physical cores to kernel-mode operations during some of the execution time. Thus we expect the throughput to scale linearly with the number of threads up to the number of hardware threads; $2 \times 16$ because of SMT. After that point, depending on the scheduler, we expect either a plateau or even a degradation in throughput with increasing the number of software threads.

And this exact behavior is demonstrated on Figure 4.

Assuming perfect scaling, we expect that the maximum throughput is $8 \times 16 \times 40,000 = 5,120,000$ TPS. The actual obtained maximum throughput is approximately 25% lower than that, because the operating system is doing a certain amount of work, and also because all CPUs doing work simultaneously produce more heat than a small subset of them would. This heat affects the whole CPU reducing the maximal clock speed and therefore per-core performance.

Figure 4: RYZEN 9950x. Scaling of TPS with the number of threads for a given number of lanes on RYZEN9 9950x. After saturating the physical cores, the throughput stabilizes. The operating system’s interference is minimal.

‍

Figure 5: The scaling of RYZEN 9 9950x as compared to the scaling of EPYC 9575F in terms of TPS.

‍

Figure 6: The scaling of RYZEN 9 9950x as compared to the scaling of EPYC 9575F in terms of data processing rate.

‍

Beyond the 5M TPS Wall

The theoretical maximum throughput assuming perfect scaling is 5 million TPS. To scale beyond that, one needs different hardware. A machine with a higher number of cores and a similar architecture, (specifically having the same level of AVX-512 support), should be able to continue to scale. For example, the AMD EPYC 9575F is a CPU that has approximately four times the CPU cores and is otherwise identical in architecture to the RYZEN chip. We expect it to have roughly four times the throughput as a consequence.

💡 That wasn’t always the case. Zen 2 desktop chips were monolithic. Zen 2 server CPUs use the now-standard chiplet architecture, where a CPU was composed of several smaller chips with a fast inter-chip connection system, known as an AMD Infinity Fabric.

Figure 5 demonstrates the EPYC server’s throughput performance relative to the RYZEN desktop. It easily surpasses the 5 million TPS barrier, the theoretical maximum transaction processing rate for the desktop. What came as a positive surprise, is that the 64-core EPYC machine scales beyond the expected quadrupling of the throughput of the 16-core RYZEN, meaning that one EPYC CPU core can do slightly more work than a single RYZEN core.

Figure 6 demonstrates a similar picture in terms of data throughput. The EPYC server can handle peak data rates of 20 Gbps and 30 Gbps without much issue and without allocating many threads. With the optimal thread allocation it can handle 50 Gbps traffic reliably and demonstrates peak throughput of 65 Gbps.

Line-speed processing

The final question we return to is whether we have achieved line-speed processing. Figure 5 demonstrates that indeed, for messages of size 1,024, on the server-grade machine we achieve an almost 70 Gbps line-speed processing speed, at 69.3 Gbps. We have also achieved 20 Gbps line rate processing by allocating 128 out of the available 1,024 processing lanes, which is roughly 8 of the 64 physical cores, all operating with the recommended 8-lane SIMD. The rest of the cores could be used for transaction execution and other activities.

The desktop chip also demonstrated impressive processing capability: it can handle 20 Gbps at 256 lanes, which is 32 threads. We see some scaling beyond that, likely due to memory limiting the lookup operations in the caches, and more software threads leading to better memory controller utilization.

Summary

We summarize the high-level performance characteristics in the table below.

Machine	Desktop	Server
Peak TPS	3.7 M	9.0 M
Peak data processing rate	30.2 Gbps	69.3 Gbps
75% percentile	0.25 ms	0.19 ms
Median latency	0.11 ms	0.11 ms

‍

Discussion

Signature schemes before ed25519

While Bitcoin is considered the first major blockchain, digital signatures existed long before. In fact, digital signature schemes are as plentiful and diverse as they are old.

The RSA algorithm is one with a long history, and widespread use. It is based around the assumption that prime factorization is a hard problem. It is not impossible to compute the private key given a public key, but it is rather impractical to do so. Most private keys today are either 1,024 or 4,096 bits long (although, you should really not use fewer than 2,048), which is far longer than the largest known factored RSA number (829 bits).

Still, quantum computers can in principle efficiently crack RSA. Outside such exotic hardware, there are more practical means of defeating RSA. There can be faulty key generation libraries, as was discovered in October 2017. There are timing attacks. Implementing the pseudo-random number generator is tricky. While there are still many areas which rely on RSA for cryptography, the general recommendation is to avoid RSA. It was excellent when it came out; but we have had since 1977 to find flaws and find a way to break the encryption. The fact that we have found so many flaws should not come as a surprise. Though it is not an indictment (a long-enough RSA still offers decent security), there was a desire for better encryption.

The big leap forward was the introduction of elliptic curves. They offered stronger protection and were generally faster. What do we mean by that? The level of security depends on a number of factors, chief among which is the key length. Most conventional methods of cracking RSA can be defeated by using a longer key. However, that also means that more data needs to be stored, and that the verification process takes longer. Elliptic curves offer the same level of resistance to cracking, but with significantly much shorter key lengths.

Side channels

By contrast secp256k1 has an abundance of exploitable side-channels. You could well argue that we don’t care about side-channels when we verify (and we do so later), after all, the data is publicly available. We do, however, care about side-channel protection on the signing side.

For instance your signing routines on the device containing the private key would be exposed to side-channel attacks, e.g. your “ultra secure” air-gapped PC that signs Bitcoin transactions for you can leak your private key through a side-channel. Hardware wallets or HSM devices such as e.g. the YubiKey that do not allow you to change/upgrade the firmware/software are the most vulnerable.

Consequently, ed25519 is considered safe, while secp256k1 is not [SafeCurves 2017]; and the question of ed25519 verification being slow is an engineering problem that we can solve.

How fast is ed25519?

With that in mind, here’s a comparison of the relative performance in OpenSSL 3.5.0 tested on a 2021 Macbook Pro.

Algorithm	Sign/s	Verify/s
RSA 3072	667.2	36,046.1
secp256k1	64,135.3	21,111.7
ed25519	36,743.4	13,117.7

‍

And here are the same measurements on a RYZEN9 9950x.

Algorithm	Sign/sec	Verify/sec
RSA 3072	2,165.8	49,479.2
secp256k1	99,210.9	32,735.7
ed25519	60,950.5	20,966.4

‍

On the one hand, RSA is the fastest for signature verification. The Bitcoin choice of secp256k1 is the fastest for both signing, and verifying signatures. And ed25519 is dead last in terms of signature verification (at least given OpenSSL’s code), and while faster than RSA for signing (by a significant margin), it is still slower than secp256k1.

So why do people say that ed25519 is fast? Simply put, because almost all libraries implementing ed25519 sign and verify routines are constant time. In other words, ed25519 is both side-channel resistant. And they are also easier to work with preventing common usage errors.

Both our work here and related work below are examples of trying to accelerate ed25519 further.

Scaling beyond SIMD

Depending on the CPU architecture, our method scales well with the number of CPU cores. Ideally every physical CPU core should have the same vector processing unit each at its disposal. In some cases, however the hardware is shared [Mysticial 2025] as is the case with Zen 4.

Ideally, the throughput of signature verification should scale linearly in the total number of 64-bit lanes that fit into the overall width of the vector registers (that is, 8 lanes for 512-bit registers and so on). This scaling can only be negatively affected by overall clock speed.

With multi-threading, especially on heterogeneous architectures such as Zen 5, the situation is a great deal more complex. Not all core complexes have the same access to the same memory locations. Some Zen 5 CPUs share a single memory controller, located on a dedicated I/O chiplet, communicating with the core compute dies (CCD) via an interposer. Or to put it simply, it’s far away, and highly contended. The operating system also introduces uncertainty, by assigning threads to do kernel work.

Sub-linear scaling is going to be representative of most real-world deployments. While we could attempt to control for these factors, the reality is that most real deployments will have that many factors and often more. Most bare metal hosting providers shall not let you boot a uni-kernel operating system.

Related work

Recent research explores various approaches to accelerate elliptic curve cryptography (ECC) operations. For instance, [Zhang 2024] considers the use of SIMD instructions to speed up the key exchange process in the context of TLS 1.3. The paper revisits various optimization strategies for ECC and presents a more performant X25519/Ed25519 implementation using AVX-512IFMA instructions. These optimizations, covering all levels of ECC arithmetic (finite field, point arithmetic, and scalar multiplication), result in speedups of up to 35% for TLS handshake and up to 24% for DNS-over-TLS (DoT) queries, with peak server throughput for DoT queries improving by 24% to 41%.

💡AVX-512 IFMA is a CPU extension that provides "Integer Fused Multiply Add" for 52-bit numbers. On Zen5 CPUs it can output eight 104 bit results per cycle. It does this by using the mantissa bits of standard double-precision 64-bit floating point registers.

Similarly, [de Smet 2024] significantly accelerates elliptic curve operations on ARM NEON processors. Leveraging a novel representation, they implemented an extended twisted Edwards curve Curve25519 back-end within the popular Rust library “curve25519-dalek”, achieving speedups close to 20%.

[Banerjee 2024] largely concentrates on designing hardware optimally suited for the underlying problem. They present a Curve25519 and Curve448 ASIC that implements a wide variety of algorithmic optimizations, including Karatsuba multiplication and restructured arithmetic on the Montgomery ladder.

In another machine-specific approach, [Owens 2024] presents an optimized assembly implementation of EdDSA operations (Keygen, Sign, Verify) using the Ed25519 parameter on the ARM Cortex-M4. Their work discusses the optimization of field and group arithmetic on this platform to produce high-throughput cryptographic primitives.

Finally, [Gao 2024], based on the RISC-V 64-bit instruction set, proposes several methods to improve the performance of the Curve25519 public key cryptography algorithm, dubbed V-Curve25519. V-Curve25519 optimizes the implementation from large integer representation, finite field, point arithmetic, and scalar multiplication, demonstrating speedups of about 35% compared to common implementations.

Conclusions

This post shares an early performance result that aligns with the vision outlined in the Eclipse Performance Thesis. In our implementation of sigverify, we observe throughput numbers that are close to 4M TPS on commonplace desktop machines and over 9M TPS on a server-grade machine. On the server-grade machine this means that we can verify signatures a lot faster than the line speed for common 10 Gbps links — our implementation handles in excess of 65 Gbps of inbound transaction injection capacity. We have done so by utilizing the SIMD instructions more efficiently; optimizing for throughput. Our implementation demonstrates favorable scaling, specifically we show multiplicative scaling with

The width of SIMD registers;
The number of cores processing signatures;
The number of threads (including SMT threads) processing signatures

To summarize, we have designed a library that is highly specialized for use in the context of an Eclipse blockchain. Our choices with respect to the coordinate representations, the usage of 51-bit limbs, the transposed multi-lane utilization of vector registers, as well as pervasive use of caching enabled us to create a library that satisfies our need to process transactions at line speed.

Our experiments pave the way to greater horizontal scalability techniques as well: we can obviously have multiple machines that process signatures, resulting in linear speedups. The same principle applies to both our SIMD-based implementation, but also more broadly, for instance in case of techniques that rely on GPUs. However, our experiments amply demonstrate that our current approach is more than enough to make signature verification not a bottleneck.

Part 2

This post turned out to be much longer than we initially anticipated. As such, not all data that we gathered is included here. We plan to release a second part with many more details about the specifics of our implementation in due time.

In that second part we shall consider the effects of CPU architectures, how this approach can be made more efficient, and what types of hardware would be most suitable to solving this problem.

References

[Mysticial 2025]: Zen5's AVX512 Teardown + More

[pcpartpicker 2025]: PC Part Picker, RYZEN 9 9950x recommended configuration

[Vantage 2025]: f2.12xlarge pricing and specs - Vantage

[Shilov 2023]: Ampere Unveils 192-Core CPU, Controversial Benchmarks

[SafeCurves, 2017]: SafeCurves:choosing safe curves for elliptic-curve cryptography

[Soatok 2022]: Guidance for Choosing an Elliptic Curve Signature Algorithm in 2022

[Zhang 2024] ENG25519: Faster TLS 1.3 handshake using optimized X25519 and Ed25519 by Zhang et al., USENIX Security Symposium, 2024.

[Smet 2024] Armed with Faster Crypto: Optimizing Elliptic Curve Cryptography for ARM Processors by De Smet et al., Sensors. 2024

[Banerjee, 2024] A High-Performance Curve25519 and Curve448 Unified Elliptic Curve Cryptography Accelerator by Banerjee et al., IEEE High Performance Extreme Computing Conference (HPEC), 2024.

[Owens 2024] Efficient and Side-Channel Resistant Ed25519 on ARM Cortex-M4 by Owens et al., IEEE Transactions on Circuits and Systems, 2024

[Gao 2024] V-Curve25519: Efficient Implementation of Curve25519 on RISC-V Architecture, by Gao et al., Information Security and Cryptology. Inscrypt 2023.

[keyhunter 2025] Vulnerable Components of the Bitcoin Ecosystem: The Problem of Incorrect Calculation of the Order of the Elliptic Curve secp256k1

‍

Share this post