braindebug Nov 14 2019 at 17:45

Down the Rabbit Hole: A Story of One varnishreload Error — part 1

8 min

1.1K

After hitting the keyboard buttons for the past 20 minutes, as if he was typing for his life, ghostinushanka turns to me with a half-mad look in his eyes and a sly smile, “Dude, I think I got it.

Look at this” — as he points to one of the characters on screen — “I bet my red hat that if we add what I’ve just sent you here” — as he points to another place in the code — “there will be no error anymore.”
Slightly puzzled and tired I modify the sed expression we’ve been figuring out for some time now, save the file and run systemctl varnish reload. Error message gone…

“Those emails I’ve exchanged with the candidate,” my colleague continues, as his smile changes to a wide and genuine grin, “It suddenly struck me that this is the very same exact problem!”

How it all began

This article assumes some familiarity with bash, awk and systemd. Some knowledge of Varnish is beneficial, but not required.
Timestamps in example snippets have been redacted.
Co-authored with ghostinushanka.

Sun shines through the wall-sized windows on yet another warm autumn morning, a cup of freshly brewed caffeinated liquid sits to the side of the keyboard, headphones vocalize the beloved symphony of sounds covering the rustle of mechanical keyboards around and the first entry in backlog on kanban board playfully displays the fateful ticket’s title “Investigate varnishreload sh: echo: I/O error in staging”. Whenever Varnish is concerned, there is no room for error(s), even though this particular one didn’t seem to be causing any actual problems.

For those of you unacquainted with varnishreload, it is simply a shell script used to reload the configuration — also called the VCL — of the Varnish caching server.

As the ticket's title hints, the error has been encountered on one of the staging machines and I was pretty sure the Varnish routing does work in the staging environment, so my assumption was that this has to be some minor issue. Just a user-friendly output message written to a closed stream. I grab the ticket, firmly believing I'll be able to mark it resolved in under 30 minutes, pat myself on the back for clearing yet another mundane task and get back to more important things.

Hitting the wall at 200kph

Opening the varnishreload file on one of the affected servers running on Debian Stretch, I find a shell script less than 200 lines long. Briefly reading through it, I see nothing dangerous that would prevent me from running the script from terminal over and over again. After all, this is staging, even if it breaks, no one is going to complain, well… not too much, that is. I run the script and observe, only to find out that there are no errors to be seen. A couple more repeated runs to make reasonably sure that I cannot reproduce the error without any extra effort and I start devising plans to tweak and bend the script's environment. Does closing STDOUT for the script altogether (with > &-) help anything? Or STDERR? Neither did.

Obviously systemd mangles the environment in some way, but how, and… why? I fire up vim and edit the system’s varnishreload, adding set -x right under the shebang, hoping that the detailed script run output will shed some light.

File is patched, so I reload varnish, only to see that the change had completely broken the script… Output is a complete mess displaying tons of C-style code and the default scrollback buffer is not enough to find where does it come from. I feel confused. Could setting debug option for the shell script break the program it calls? No, can’t be. A bug in the shell? Multiple possible scenarios running wildly in different directions in my mind. A cup of caffeinated beverage is instantly finished, quick trip to the kitchen for a refill and here we go again. I open the file and look closely at the shebang: #!/bin/sh.

But /bin/sh is surely just a symlink to bash, so that the script is interpreted in POSIX-compliant mode, right? Wrong! The default non-interactive shell on Debian is dash, and that's exactly what /bin/sh points at.

# ls -l /bin/sh
lrwxrwxrwx 1 root root 4 Jan 24  2017 /bin/sh -> dash

If only for debugging, I changed the shebang to #!/bin/bash, removed the set -x and tried again. Finally, a reasonable error output from the next varnish reload:

Jan 01 12:00:00 hostname varnishreload[32604]: /usr/sbin/varnishreload: line 124: echo: write error: Broken pipe
Jan 01 12:00:00 hostname varnishreload[32604]: VCL 'reload_20190101_120000_32604' compiled

Line 124, now we're talking!

114 find_vcl_file() {
115         VCL_SHOW=$(varnishadm vcl.show -v "$VCL_NAME" 2>&1) || :
116         VCL_FILE=$(
117                 echo "$VCL_SHOW" |
118                 awk '$1 == "//" && $2 == "VCL.SHOW" {print; exit}' | {
119                         # all this ceremony to handle blanks in FILE
120                         read -r DELIM VCL_SHOW INDEX SIZE FILE
121                         echo "$FILE"
122                 }
123         ) || :
124
125         if [ -z "$VCL_FILE" ]
126         then
127                 echo "$VCL_SHOW" >&2
128                 fail "failed to get the VCL file name"
129         fi
130
131         echo "$VCL_FILE"
132 }

But as it turns out, line 124 is pretty uneventful. I could only conjecture that the error was produced as part of the multiline command executing at line 116.

So what does the above subshell even produce to store in the VCL_FILE variable? In the first part it sends the contents of the VCL_SHOW variable created on the line 115 into the pipe. What happens there, then?

First, it uses varnishadm, which is a standard part of a Varnish installation used to configure Varnish without having to restart it. The subcommand vcl.show -v is used to print the entire VCL configuration specified by ${VCL_NAME} to STDOUT.

To display the current active VCL config as well as several previous versions of the varnish routing that are still in memory, you can use another command varnishadm vcl.list, whose output would be similar to the below:

discarded   cold/busy          1 reload_20190101_120000_11903
discarded   cold/busy          2 reload_20190101_120000_12068
discarded   cold/busy         16 reload_20190101_120000_12259
discarded   cold/busy         16 reload_20190101_120000_12299
discarded   cold/busy         28 reload_20190101_120000_12357
active      auto/warm         32 reload_20190101_120000_12397
available   auto/warm          0 reload_20190101_120000_12587

The variable ${VCL_NAME} is set elsewhere in the varnishreload script to the name of the currently active VCL, if any. In this case, that would be "reload_20190101_120000_12397".

Great, so ${VCL_SHOW} now contains a full configuration for Varnish, easy enough so far. Now I finally understood why the dash output with set -x appeared to be so broken — it included the contents of the resulting varnish configuration.

The important thing here is that the full VCL config may often be spliced together from multiple files. C-style comments are used to delineate where config files were included into other config files, which is exactly what the next line of the code snippet is all about.

The syntax of the file-denoting comments has the following format

// VCL.SHOW <NUM> <NUM> <FILENAME>

The numbers are not important here, what we’re interested in is the filename.

So what in the world is happening in the slew of commands beginning on line 116?
Let's pick it apart.
There are four parts to the command:

A simple echo that prints out the value of ${VCL_SHOW}
```
echo "$VCL_SHOW"
```
awk that looks for a line (record) where the first field is '//' and the second is "VCL.SHOW".
Awk is instructed to print the first line matching these patterns and then immediately stop processing.
```
awk '$1 == "//" && $2 == "VCL.SHOW" {print; exit}'
```
A code block that reads in the whitespace-delimited fields into five variables. The fifth variable FILE gets the remainder of the line. Finally, one last echo prints the contents of the ${FILE} variable.
```
{ read -r DELIM VCL_SHOW INDEX SIZE FILE; echo "$FILE" }.
```
As steps 1 through 3 are all encased in a subshell, the output of $FILE will end up in the variable VCL_FILE.

As the comment on line 119 suggests, this way of doing things serves a single purpose: to reliably handle the case where VCL would be referencing filenames with spaces.

I commented out the original processing logic for the ${VCL_FILE} and tried to tweak the chain of commands but to no reasonable end. Everything worked in my shell but never when run as a service.

It seems the error is not at all replicable when run by me — meanwhile the estimated 30 minutes passed six times and a new high-priority task put everything aside. The rest of the week was quite full with different tasks, the two exceptions being an internal talk our team had about using sed and an interview with a promising candidate. The issue with making varnishreload error disappear was completely lost to the sands of time.

Your so-called sed-fu… is really… quite pathetic

One of the days of the week that followed was pretty free, so I picked the task up again. I had hoped that maybe some background process in my brain was still chipping away at the problem and I'll finally be able to crack it.

Since bending the code last time didn't help, I just opted for a rewrite of line 116. The existing code was insane, anyway. There is absolutely no need to use read here.

Looking at the error again:
sh: echo: broken pipe — echo is in two places in that command, but I suspect the very first one to be a more likelier culprit (or an accomplice). Awk doesn't inspire confidence either. Well, in case it really is the awk | {read; echo} construct causing all this trouble, why not use something else? Awk's not really being used to its full capabilities on that one-liner and then there is this surplus read.

Seeing as we had an internal talk about sed the other week, I wanted to try my newly-acquired skills and optimize the echo | awk | { read; echo } into a simpler echo | sed. Although that’s definitely not the proper way to approach debugging, I thought I’d at least try out my sed-fu and maybe learn something new about the problem in the process. In the process, I’ve asked my colleague — the author of the sed talk — to help me come up with a more efficient sed command.

I’ve dumped the varnishadm vcl.show -v "$VCL_NAME" into a file, so I could focus on writing sed without all the hassle around service reloads.

A short primer on how exactly sed processes input can be found in its GNU manual. In sed sources character \n is explicitly specified as the line separator.

After several iterations and input from my colleague, we’ve crafted a sed expression that did produce exactly the same result as the original line 116.

Let’s create a sample input file here,

> cat vcl-example.vcl
Text
// VCL.SHOW 0 1578 file with 3 spaces.vcl
More text
// VCL.SHOW 0 1578 file.vcl
Even more text
// VCL.SHOW 0 1578 file with TWOspaces.vcl
Final text

It might not be apparent from the above description, but we’re only interested in the first // VCL.SHOW comment, and there may be several on input. That’s exactly why awk quits after the first match.

# step 1, capture just the comment lines
# using sed capability to specify delimiter character with ‘\#’ instead of the commonly used ‘/’ so there is no need to escape slashes themselves
# and the “address” capability defined as regex “// VCL.SHOW” to search for lines with specific pattern
# -n flag makes sure that the sed does not print all as it does by default (see above link)
# -E switches to the extended regex
> cat vcl-processor-1.sed
\#// VCL.SHOW#p
> sed -En -f vcl-processor-1.sed vcl-example.vcl
// VCL.SHOW 0 1578 file with 3 spaces.vcl
// VCL.SHOW 0 1578 file.vcl
// VCL.SHOW 0 1578 file with TWOspaces.vcl

# step 2, only print out the file name
# using the “substitute” command with regex capture groups to print just that group
# and this is done only for the matches of the previous search
> cat vcl-processor-2.sed
\#// VCL.SHOW# {
    s#.* [0-9]+ [0-9]+ (.*)$#\1#
    p
}
> sed -En -f vcl-processor-2.sed vcl-example.vcl
file with 3 spaces.vcl
file.vcl
file with TWOspaces.vcl

# step 3, make sure to only get the first result
# same as with the awk before, add an immediate exit after the first processed match is printed
> cat vcl-processor-3.sed
\#// VCL.SHOW# {
    s#.* [0-9]+ [0-9]+ (.*)$#\1#
    p
    q
}
> sed -En -f vcl-processor-3.sed vcl-example.vcl
file with 3 spaces.vcl

# step 4, wrap it up into a one-liner using the colon to separate commands
> sed -En -e '\#// VCL.SHOW#{s#.* [0-9]+ [0-9]+ (.*)$#\1#p;q;}' vcl-example.vcl
file with 3 spaces.vcl

So, the contents of the varnishreload script would look something like this:

VCL_FILE="$(echo "$VCL_SHOW" | sed -En '\#// VCL.SHOW#{s#.*[0-9]+ [0-9]+ (.*)$#\1#p;q;};')"

The above logic may succinctly be expressed by:
if a line matches the regex // VCL.SHOW, then greedily match the text including the two numbers on that line and capture whatever comes after. Emit the capture and quit.

Simple, isn't it?

We were happy with the sed script and the fact what original code it replaces, all test runs I’ve done produced desired results, so I have modified the varnishreload on the server and fired the systemctl reload varnish once again. The dreaded echo: write error: Broken pipe was smiling in our faces. The blinking cursor awaited a new command entry in the dark void of the terminal…

Hubs: