# 12.3 million of concurrent WebSockets

One thing about WebSockets is that you need a lot of resources on the client's side to generate high enough load for the server to actually eat up all the CPU resources.

There are several challenges you have to overcome because the WebSockets protocol is more CPU demanding on the client's side than on the server's side. At the same time you need a lot of RAM to store information about open connections if you have millions of them.

I've been lucky enough to get a couple of new servers for a limited period of time at my disposal for the hardware "burnout" tests. So I decided to use my Lua Application Server — LAppS to do both jobs: test the hardware and perform the LAppS high load tests.

Let's see the details

# Challenges

First of all WebSockets on the client's side require a pretty fast RNG for traffic masking unless you want to fake it. I wanted the tests to be realistic, so I discarded the idea of faking the RNG by replacing it with a constant. This leaves the only option — a lot of CPU power. As I said, I have two servers: one with dual Intel Gold 6148 CPUs (40 cores in total) 256GB RAM (DDR4-2666) and one with quad Intel Platinum 8180 CPUs (112 cores in total) 384GB RAM (DDR4-2666). I'll be referring to them as Server A and Server B hereafter.

The second challenge — there are no libraries or test suites for WebSockets that are fast enough. Which is why I was forced to develop a WebSockets client module for LAppS (cws).

# The tests setup

Both servers have dual port 10GbE cards. Port 1 of each card is aggregated to a bonding interface and these ports on either cards are connected to each other in RR mode. Port 2 of each card is aggregated to a bonding interface in an Active-Backup mode, connected over an Ethernet switch. Each hardware server was running the RHEL 7.6. I had to install gcc-8.3 from the sources to have a modern compiler instead of the default gcc-4.8.5. Not the best setup for the performance tests, yet beggars can't be choosers.

CPUs on Server A (2xGold 6148) have higher frequency than on Server B (4xPlatinum 8180). At the same time Server B has more cores, so I decided to run an echo server on Server A and the echo clients on Server B.

I wanted to see how LAppS behaves under high load with millions of connections and what maximum amount of echo requests per second it can serve. These are actually two different sets of tests because with the connections amount growth the server and clients are doing an increasingly complex job.

Before this I have made all the tests on my home PC which I use for development. The results of these tests will be used for comparison. My home PC has an Intel Core i7-7700 CPU 3.6GHz (4.0GHz Turbo) with 4 cores and 32GB of DDR4-2400 RAM. This PC runs Gentoo with 4.14.72 kernel.

Specter and Meltdown patch levels
• Home PC
/sys/devices/system/cpu/vulnerabilities/l1tf:Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
/sys/devices/system/cpu/vulnerabilities/meltdown:Mitigation: PTI
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass:Mitigation: Speculative Store Bypass disabled via prctl and seccomp
/sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: __user pointer sanitization
/sys/devices/system/cpu/vulnerabilities/spectre_v2:Vulnerable: Minimal generic ASM retpoline, IBPB, IBRS_FW
• Servers A and B
/sys/kernel/debug/x86/ibpb_enabled : 1
/sys/kernel/debug/x86/pti_enabled : 1
/sys/kernel/debug/x86/ssbd_enabled : 1
/sys/kernel/debug/x86/ibrs_enabled : 1
/sys/kernel/debug/x86/retp_enabled : 3

I'll make an extra note with the results of the tests of Servers A and B whether these patches are enabled or not.

On my Home PC patch level is never changed.

## Installing LAppS 0.8.1

Installing prerequisites and the LAppS

LAppS depends on the luajit-2.0.5 or higher, the libcrypto++8.2 and the wolfSSL-3.15.7 libraries and they have to be installed from sources in RHEL 7.6 and likely in any other linux distribution.

The prefix for installation is /usr/local. Here is the pretty much self-explanatory Dockerfile part for wolfSSL installation

ADD https://github.com/wolfSSL/wolfssl/archive/v3.15.7-stable.tar.gz ${WORKSPACE} RUN tar xzvf v3.15.7-stable.tar.gz WORKDIR${WORKSPACE}/wolfssl-3.15.7-stable

RUN ./autogen.sh

RUN ./configure CFLAGS="-pipe -O2 -march=native -mtune=native -fomit-frame-pointer -fstack-check -fstack-protector-strong -mfpmath=sse -msse2avx -mavx2 -ftree-vectorize -funroll-loops -DTFM_TIMING_RESISTANT -DECC_TIMING_RESISTANT -DWC_RSA_BLINDING" --prefix=/usr/local --enable-tls13 --enable-openssh --enable-aesni --enable-intelasm --enable-keygen --enable-certgen --enable-certreq --enable-curve25519 --enable-ed25519 --enable-intelasm --enable-harden

RUN make -j40 all

RUN make install


And here is the part for the libcrypto++ installation

ADD https://github.com/weidai11/cryptopp/archive/CRYPTOPP_8_2_0.tar.gz ${WORKSPACE} RUN rm -rf${WORKSPACE}/cryptopp-CRYPTOPP_8_2_0

RUN tar xzvf ${WORKSPACE}/CRYPTOPP_8_2_0.tar.gz WORKDIR${WORKSPACE}/cryptopp-CRYPTOPP_8_2_0

RUN make CFLAGS="-pipe -O2 -march=native -mtune=native -fPIC -fomit-frame-pointer -fstack-check -fstack-protector-strong -mfpmath=sse -msse2avx -mavx2 -ftree-vectorize -funroll-loops" CXXFLAGS="-pipe -O2 -march=native -mtune=native -fPIC -fomit-frame-pointer -fstack-check -fstack-protector-strong -mfpmath=sse -msse2avx -mavx2 -ftree-vectorize -funroll-loops" -j40 libcryptopp.a libcryptopp.so

RUN make install


And the luajit

ADD http://luajit.org/download/LuaJIT-2.0.5.tar.gz ${WORKSPACE} WORKDIR${WORKSPACE}

RUN tar xzvf LuaJIT-2.0.5.tar.gz

WORKDIR ${WORKSPACE}/LuaJIT-2.0.5 RUN env CFLAGS="-pipe -Wall -pthread -O2 -fPIC -march=native -mtune=native -mfpmath=sse -msse2avx -mavx2 -ftree-vectorize -funroll-loops -fstack-check -fstack-protector-strong -fno-omit-frame-pointer" make -j40 all RUN make install One optional dependency of the LAppS, which may be ignored, is the new library from Microsoft mimalloc. This library makes significant performance improvement (about 1%) but requires cmake-3.4 or higher. Given the limited amount of time for the tests, I decided to sacrifice mentioned performance improvement. On my home PC, I'll not disable mimalloc during the tests. Lets checkout LAppS from the repository: WORKDIR${WORKSPACE}

RUN rm -rf ITCLib ITCFramework lar utils LAppS

RUN git clone https://github.com/ITpC/ITCLib.git

RUN git clone https://github.com/ITpC/utils.git

RUN git clone https://github.com/ITpC/ITCFramework.git

RUN git clone https://github.com/ITpC/LAppS.git

RUN git clone https://github.com/ITpC/lar.git

WORKDIR ${WORKSPACE}/LAppS Now we need to remove "-lmimalloc" from all the Makefiles in nbproject subdirectory, so we can build LAppS (assuming our current directory is${WORKSPACE}/LAppS)

# find ./nbproject -type f -name "*.mk" -exec sed -i.bak -e 's/-lmimalloc//g' {} \;
# find ./nbproject -type f -name "*.bak" -exec rm {} \;

And now we can build LAppS. LAppS provides several build configurations, which may or may not exclude some features on the server side:

• with SSL support and with statistics gathering support
• with SSL and without statistics gathering (though minimal statistics gathering will continue as it is used for dynamic LAppS tuning in runtime)
• without SSL and without statistic gathering.

Before next step make sure you are the owner of the /opt/lapps directory (or run make installs with sudo)

sed -e "s/IPADDR/$(cat IP1)/g" /opt/lapps/apps/benchmark/benchmark.lua > /opt/lapps/apps/benchmark${i}/benchmark${i}.lua done Now we modify the /opt/lapps/etc/conf/lapps.json on Server B to use these two benchmark services: Server B lapps.json { "directories": { "app_conf_dir": "etc", "applications": "apps", "deploy": "deploy", "tmp": "tmp", "workdir": "workdir" }, "services": { "benchmark1": { "auto_start": true, "instances" : 112, "internal": true, "preload": [ "nap", "cws", "time" ] }, "benchmark2": { "auto_start": true, "instances" : 112, "internal": true, "preload": [ "nap", "cws", "time" ] } } } Are we ready? No we are not. Because we are intent to generate 2 240 000 outgoing sockets to only two addresses and we need more ports on the server side. It is impossible to create more then 64k connections to the same ip:port pair (actually a little less then 64k). On the Server A in the file LAppS/include/wsServer.h there is a function void startListeners(). In this function we will replace line 126 LAppSConfig::getInstance()->getWSConfig()["port"], with this: static_cast<int>(LAppSConfig::getInstance()->getWSConfig()["port"])+i, Rebuild LAppS: make CONF=Release.AVX2 install # Running the tests for 2 240 000 client WebSockets. Start the LAppS on Server A, then start the LAppS on Server B and redirect the output to a file like this: /opt/lapps/bin/lapps.avx2 > log Edit, NB: if you want to run several LAppS instances, then create separate directory for each instnace (like run1,run2, etc) and run each instance from within these directories. This is required for lapps.log file and of course for resulting standard output, for not to be overlapped/overwritten by concurrent LAppS instances.  This may take a while till all the connections are established. Let's monitor the progress. Do not use netstat to watch established connections, it is pointless while it runs indefinitely after like 150k connections which are established in some seconds. Look at the lapps.log on the Server A, in the directory which you were in when started LAppS. You may use following one-liner to see how the connections are established and how are they distributed among IOWorkers: date;awk '/ will be added to connection pool of worker/{a[$22]++}END{for(i in a){ print i, a[i]; sum+=a[i]} print sum}' lapps.log | sort -n;date

Here is an idea on how rapidly these connections are established:

375874
Sun Jul 21 20:33:07 +06 2019

650874
Sun Jul 21 20:34:42 +06 2019   2894 connections per second

1001874
Sun Jul 21 20:36:45 +06 2019   2974 connections per second

1182874
Sun Jul 21 20:37:50 +06 2019   2784 connections per second

1843874
Sun Jul 21 20:41:44 +06 2019   2824 connections per second

2207874
Sun Jul 21 20:45:43 +06 2019   3058 connections per second


On the Server B we may check amount of benchmarks finished establishing their connections:

# grep Sockets log | wc -l
224

After you see that the number is 224 let the servers work for a while, monitor the CPU and memory usage with top. You might see something like this (Server B at left and Server A at the right side):

Or like this (Server B at left and Server A at the right side):

It is clearly the Server B, where the benchmark client services are running, is under heavy load. The Server A is under heavy load too but not always. Some times it chills while the Server B struggles with amount of tasks to work on. The traffic ciphering on TLS is very CPU intensive.

Let's stop the servers (press Ctrl-C several times) and modify the log as before but with respect to changed amount of benchmark services (224):

echo -e ':%s/ms/^V^M/g\n:%g/^$/d\nGdd1G224/Sockets\n224\ndggZZ' | vi$i
awk -v avg=0 '{rps=($1/$5);avg+=rps;}END{print ((avg*1000)/NR)*224}' log

You might want to delete several last lines as well, to account for print outs from stopping benchmark services. You've already got an idea on how to do this. So I'll post the results for different test scenarios and proceed with problems I've faced.

# Results for all the other tests (4.9 million and beyond)

test #5 balanced for equal CPU sharing

test #6

test #8

test #12

test #13

# Problems.

Marked in above table with red.

Running more then 224 benchmark instances on the Server B proved to be the bad idea, as the server on client side was incapable of evenly distribute the CPU time among processes/threads. First 224 benchmark instances which have established all their connections took most of the CPU resources and the rest of benchmark instances are lagging behind. Even using renice -20 does not help a lot(448 benchmark instances, 2 LAppS instances):

448 benchmark instances

The Server B (left side) is under very heavy load and the Server A still has free CPU resources.

So I doubled the benchmark.max_connections instead of starting more of the separate LAppS instances.

Still for 12.3 million of WebSockets to run, I have started 5th LAppS instance (tests 5 and 6) without stopping four already running ones. And played the role of CFQ by manually suspending and restarting prioritized processes with kill -STOP/-CONT or/and renice their priorities. You can use following template script for this:

while [ 1 ];
do
kill -STOP <4 fast-processes pids>
sleep 10
kill -CONT <4 fast-processes pids>
sleep 5
done

Welcome to 2019 RHEL 7.6! Honestly, I used the renice command first time since 2009. What is worst, — I used it almost unsuccessfully this time.

I had a problem with scatter-gather engine of the NICs. So I disabled it for some tests, not actually marking this event into the table.

I had partial link disruptions under heavy load and the NIC driver bug, so I had to discard related test results.

# End of story

Honestly, the tests went much smoother than I've anticipated.

I'm convinced that I have not managed to load LAppS on Server A to its full potential (without SSL), because I had not enough CPU resources for the client side. Though with TLS 1.3 enabled, the LAppS on the Server A was utilizing almost all available CPU resources.

I'm still convinced that LAppS is most scalable and fastest WebSockets open source server out there and the cws WebSockets client module is the only of it's kind, providing the framework for high load testing.

Please verify the results on your own hardware.

Note of advice: Never use nginx or apache as the load balancer or as a proxy-pass for WebSockets, or you'll end up cutting the performance on order of magnitude. They are not build for WebSockets.