Back in 2016 an United States based music composer and performer Sergio Elisondo released an one-man band music album A Winner Is You (know your meme), with multi-instrumental cover versions of tunes from numerous memorable classic NES games. A special feature of this release has been its version released in the NES cartridge format that would run on a classic unmodified console and play digitized audio of the full album, instead of the typical chiptune sound you would expect to come from this humble console. I was involved with the software development part of this project.
This year Sergio makes a return with a brand new music release. This time it is all original music album You Are Error, heavily influenced with the video game music aesthetics. It also comes with a special extra. This time we have raised the stakes, and a new NES cartridge release includes not only the digitized audio, but full motion videos for each song, done in the silhouette cutout style similar to the famous Bad Apple video. Yet again, this project is crowdfunded via Kickstarter. It already got the asked amount in a mere 7 hours, but there is still a little time to jump on the bandwagon and get yourself a copy. In the meantime I would like to share an insight on the technical side of both projects.
Lots of ROM
So, how would one take an old console that has been released back in 1983, with the hardware that is likely originates from the even earlier times, that only has a handful of kilobytes of memory and mere megahertz of the 8-bit processing power, and make it to play digitized audio, let alone a full motion video?
Sure, it is possible to put a hardware MP3 decoding chip into the cartridge, or even better, a Raspberry Pi-like single board PC full of 32-bit power, and make it run the classic Doom. This is pretty cool, however, it raises a question of why the NES is even needed then. At least from a tech purist standpoint it would be difficult to say that the NES is capable of doing these miracles on its own. It sure looks pretty impressive regardless, especially to those who aren’t picky about technical details.
These considerations were among the reasons why we’re picked a bit more conventional approach — just a huge PRG ROM. For this purpose, a new custom cartridge board with 64MB (megabytes) ROM on board has been developed by RetroUSB. He also created a new custom mapper (the memory paging control device), very UNROM-like, just with 4096 16KB banks to pick from. This isn’t exactly authentic to the 80s era either, however, this way the actual audio and video playback is (barely) handled by the console all on its own.
Development and debug
Once the exact way we were going to do this has been decided, the problem of setting up a development toolchain has arisen. The new custom mapper was of course not supported in any emulator in existence, so to have something to begin with I had to modify the popular FCEUX emulator first. It features many nice functions, useful to debug, however, its emulation precision is not exactly at a high point. It was just enough for the first project, but for the second one I had to modify a couple of more precise emulators as well. None of those has been released to the public, as there were no ROM releases, and the mapper does not even have an iNES mapper number assigned to it.
Testing on the real hardware has been complicated with a quirky burning process — by some reason it takes nearly 4 hours to reflash the board. The lack of the board and the original console in my hands didn’t help, either. Thus Sergio was flashing my tests occasionally, and then I was doing some wild guesses based on the results and fixing it using an emulator.
I was using my regular NES development toolchain — the CA65 cross assembler of the CA65 package that allows to configure the binary output towards needs of an arbitrary 6502 powered system, the Notepad++ as code editor, Graphics Gale and GIMP to process the graphics, Wavosaur and Audacity to process the audio, VirtuaDub to process the video source, my own NES Screen Tool to prepare NES graphics, and some other tools. I also programmed custom converters for the audio and video data using Visual C++ Express. The amount of the data was pretty large, and the algorithms were bruteforce, so compiled C code was needed to do the processing faster, although for something smaller and simpler I would use Python, as it works pretty well for things like this. The compilation process has been automated with use of the regular Windows batch files and some Python scripts (my tastes are very singular).
Playing the audio
The first project, A Winner is You, featured digitized audio only, without doing much besides this at the same time. That is, no animation has been displayed on the screen while playing the audio. This is relatively simple task code wise — we’re just fetching bytes from the ROM, switching the ROM banks as needed, and outputting the bytes to the APU DAC at a steady rate. This can be accomplished by nearly any home micro or a game console of the past, given that it has access to a ROM large enough, and a DAC.
The main difficulty of playing digital audio on such a simple machine like the NES is the lack of means to precisely synchronize the code with the real time such as high resolution timers and interrupts. The only time source is the CPU clock frequency and the number of clocks spent to execute each specific operation. This means that to provide a specific steady rate of outputting data to the DAC, the code needs to be carefully planned and timed very precisely. This, however, is a rather standard trick for such old systems, and it was perfected over the years, so it wasn’t a big deal. The NES CPU is fast enough to provide a theoretical maximum sampling rate of ~74 kHz for NTSC and ~69 kHz in PAL. Considering the amount of available ROM and the amount of the audio data to be allocated, the more traditional sample rate of 44100 Hz has been picked. The NES APU DAC has 7-bit resolution, so resulting sound quality is along lines of the original 8-bit Sound Blaster that featured a slightly better 8-bit DAC, but at a twice lower sample rate.
The player shell allowed the user to seek through a track, fast forward and rewind it with sound speeding up and reversing, just like the compact cassettes of the past did, as well as a pause and slow down. The player also supports both NTSC and PAL modes providing correct pitch and tempo. It has been implemented via a handful versions of the sound loop code that had different timings. When the user presses a button, an icon gets changed on the screen. To update the graphics, the audio stops for a moment, however, as the resulting action is always creating some change in the sound, this gap is not noticeable by the ear.
To illustrate all of the above here is a glimpse of the regular speed sound loop. It plays a number of 256-sample packets and features the NTSC timings. The number of CPU clocks is specified for each operation in the comments.
;1789773/44100=40t (fCpu/fSampleRate) playLoopNTSC: lda (BANK_OFFSET),y ;5+ fetching a byte from a ROM bank sta APU_DMC_RAW ;4 sending it to the DAC sta LAST_DMC_RAW ;4 also storing for later use in the player nop ;2 the common delay in both code paths nop ;2 nop ;2 nop ;2 nop ;2 nop ;2 iny ;2 increase pointer LSB beq :+ ;2/3+ go the MSB increment path on overflow nop ;2 extra delay for the shorter code path nop ;2 nop ;2 nop ;2 nop ;2 jmp playLoopNTSC ;3=40t take the loop : inc <BANK_OFFSET+1 ;5 increase pointer LSB dex ;2 count 256-byte blocks bne playLoopNTSC ;2/3+ =40t take the loop
This code expects that a 256-byte data block won’t ever cross the ROM bank boundary. The banks get switched in the outer loop. Thanks to the mapper design, it is done in a very efficient way:
ldy #0 ;zero offset sta (BANK_NUMBER),y ;switch the bank
The word variable BANK_NUMBER stores the number of the currently used ROM bank. The bank's numbers are in the range of $8000..$8FFF, that is, 0..4095 with the most significant bit set. Mapper watches for any writes to the ROM addresses and translates them into bank selection, by latching the lowest 12 bits of the address into the bank selection register. The actual content of a write does not matter, just the address used.
Showing the video
Performing any actions alongside playing a digitized audio, especially a full motion animation, is a much more ambitious task for the humble old NES. Not only the timings of the code has to be set very precisely to provide a steady sample rate, but the access to the video memory is only possible in certain times, namely the vertical blanking period. The video memory also can only be accessed indirectly, though a single-byte communication port with serial access.
The extra challenge is provided by the video system design of the NES. Basically it is a text mode with some hardware sprites on the top. The background layer can only display 256 unique characters (patterns) from a character set that is stored in the ROM, or gets loaded into character RAM (if it is installed to the cartridge board). Whenever the graphics of any particular character get changed, it changes everywhere on the screen at once. Thus the basic setting or clearing a mere pixel in an arbitrary location of the screen, and keeping this change for all subsequent frames, which is common on the raster buffer based systems, is a challenge on the NES. Also, to create animation on the NES it is often needed to update a few sets of data at once — the pattern data, the nametable (character map), and the color attributes area.
All of this turns even a simple blitting of the uncompressed video frames into the video memory into a tricky challenge. This is the reason that my video stream format lacks any interframe compression, which is otherwise very common for the video data from the dawn of time.
Nevertheless, my video stream format features a kind of intraframe compression, a lossy one even. Its purpose is not to just reduce the amount of the data, but to reduce the number of unique characters in a video frame. It is important, as only so many characters can be uploaded into the video memory during the access time in a single TV frame.
This “compression”, or rather a character set optimization, has been implemented in my tools for a while, and it often comes very handy for my NES projects. The premise is very basic — it seeks for two most visually similar characters in a set (the main difficulty is to pick a criteria of similarity), remove one of the characters by replacing it with the other, and repeat the process until the desired number of the characters in the set is reached. This approach provides major visual artifacts, the more prominent with increase of the “compression” ratio, and it does not do a good job for images with fine details, quickly turning it into noise.
The necessity of employing such a compression technique led to the stylistic choice for the actual video content — we’ve picked the silhouette style video along the lines of the famous Bad Apple video, which also featured this style. In fact, this video has become a kind of benchmark for video playback projects for retro computers, and I was using it during development as a test to my video encoder as well.
The pictures below demonstrate the optimization process in action — the source picture is made of 960 unique characters, then it gets reduced to 256, 128 and 64 unique characters.
The regular vertical blanking period, when the video memory can be accessed, lasts for about 2300 CPU clocks per TV frame, that’s 22 scan lines, 113.6 clocks each. In order to use this very limited time most effectively, an unrolled loop has to be used. It may be done in a few ways. This code that uses absolute addressing mode for source data will allow to transfer about 300 bytes in the given time:
lda SRC ;4 sta PPU_ADDR ;4 - 8 clocks per byte
Faster copying is possible, but it comes with limitations and difficulties. Here is a couple of mid-optimal solutions that would buffer the data into the zero page or stack page of the RAM:
lda <SRC ;3 sta PPU_DATA ;4 - 7 clocks per byte pla ;3 sta PPU_ADDR ;4 - 7 clocks per byte as well
The fastest way of copying data to the VRAM considers storing the data directly inside the immediate load opcodes, via use of the self-modifying code:
lda #NN ;2 sta PPU_ADDR ;4 - 6 clocks per byte
This trick allows a transfer of nearly 400 bytes per standard vertical blanking period of a TV frame. However, it takes 5 bytes of code to transfer one byte of data, and 400*5 = 2000 bytes of code, nearly the whole amount of the RAM that is available for the NES. That’s why the further calculations consider the least optimal approach with 8 clocks per byte.
In order to make a full update of the screen, which includes the whole character set and nametable, 5 kilobytes of data has to be transferred to the video memory. Considering the numbers above, it would take 5120/300 = 17 TV frames, resulting in the video frame rate of 60/17 = 3.5 frames per second.
To increase the throughput, the blanking time needs to be extended somehow. It is possible to do by forcing the blanking in some scanlines of the visible raster. Those will be displayed as a solid background color then. Using the 8-clock copying, about 14 bytes can be transferred in each extra blanking scanline. However, even if the whole screen is blanked, it will only allow to transfer less than 4KB per TV frame.
So many factors to be considered requires to find a balance between the amount of the data to be transferred, the number of scanlines to be displayed or blanked, and the number of TV frames spent to perform a full frame update, to provide a smooth frame rate in the video animation.
The commonly accepted minimal frame rate to maintain an illusion of the motion is considered to be 12-18 frames per second. It should be also considered that besides the video memory updates, the player code has to constantly maintain audio data fetching and outputting through the DAC on evenly and steadily spread time intervals, which are a few dozen of CPU clock long. This means that not the whole blanking period is actually available to access the video memory, it is partially spent to play the audio, too.
After many experiments this balance has been found:
256x160 pixels resolution (32x20 characters)
4 colors with a per frame palette
212 unique characters per frame
15 frames per second for NTSC, 12.5 frames per second for PAL
Sample rate of 27360 Hz for NTSC, 25450 Hz for PAL
There were experiments with multi-palette conversion, too, by using 4 separate palettes and color attributes to increase the max totals of colors per frame to 13. However, it proved to be very tricky at the image conversion stage — the low resolution of the color attributes didn’t allow to find an approach that would split the image into areas with smooth transition between sub palettes. As it was not clear how to create such an algorithm, and if it is possible to do at all, it has been decided to stick to the basic 4 colors and just colorize some of the video sections into a sepia.
The full screen update in the resulting player takes four TV frames both in NTSC and PAL. The difference in video frame rate and audio sample rate is created with the difference of the TV frame rates (60/4=15, 50/4=12.5) and the difference in the main CPU clock frequency. The audio sample rate is defined by the fact that a sample gets outputted to the DAC every 64 CPU clocks; this number remains the same in both versions.
The data format
In order to make the seeking (fast forwarding and rewind) through a video stream easier, the data size of a single video frame has been set to a fixed value, 8K per frame. Having this number, it is also easy to calculate how much video content can fit into the available ROM. It takes 120K per second, so a 64MB board can fit about 546 seconds of video, i.e. almost 10 minutes.
Each TV frame in a video frame is presented with a 2K packet that consists of 8 256-byte blocks. The data layout in the packets is very tricky, so it is hard to describe even for me now. The number of factors affected the creation of such a messed up layout:
The data has to be read from the ROM in the middle of the visible raster, and has to be transferred to the destination both in the bottom part of the current TV frame and in the top part of the next TV frame.
The code has to be very optimized, so the data is located in a way to allow the easiest access to a particular data piece at a given time.
The format was changing and tweaked up all the time. The PAL support has been introduced lately, so it was easier to keep some of the preceding versions layout to avoid introducing even more changes in the already debugged code.
Each of the 2048-byte packets contain:
456 bytes of the NTSC audio (456*60=27360 Hz)
509 bytes of the PAL audio (509*50=25450 Hz)
The first three 2048-byte packets also contain:
1024 bytes of the character data (64 characters)
The last packet stores different video data:
320 bytes of the character data (20 characters)
13 bytes of the color palette
640 bytes of the nametable data
48 bytes of the color attributes data
In order to make the same video data useable both in NTSC and PAL, which is important to avoid duplication of the whole video stream, the PAL mode player skips some of the frames (3 of the 15 per second) — luckily it is rather easy to do without having the interframe compression. The audio data is stored in two versions, however, as it would complicate the code a lot otherwise. If the audio data would be presented with just a single version, the delays between the DAC outputs had to be different between PAL and NTSC, and it would take huge changes in the code that transfers the data into the video memory.
The video player code can’t simply copy the data from the ROM to DAC and the video memory. It is complicated with the way the mapper works. The NES architecture provides 32K of the address space for the ROM data. The mapper keeps the top 16K of these 32K fixed, i.e. maps a fixed ROM bank to this area, which contains the reset and NMI vectors, and such. The bottom 16K of the 32K is switchable, any of the 4096 ROM banks can be switched in there. As the usage of the unrolled loops is necessary to achieve the desired performance, the code size gets pretty large and has to be stored in the switchable ROM banks. However, this code needs to access the data that is also located in the switchable ROM banks, and as they both can’t be switched in, it makes the data inaccessible. To solve this issue, a buffering pipeline that squeezes the buffered data into the much limited 2K of the NES RAM has been employed.
The code can be split into two functional parts, the readers, and the pushers. These are called from different parts of the visible raster and in different TV frames from the main update loop. Double buffering is implemented, so the partially updated video frame is hidden from the screen, and only gets displayed once fully updated. To implement this, both PPU character sets and nametables have been used, this means that the pushers have to transfer data into different locations of the video memory at alternated video frames.
There are two readers, one for NTSC and another for PAL. A reader gets invoked during the active part of the raster, it fetches the data from a ROM bank and stores it in the RAM for further use. It also fetches the audio data and outputs it to the DAC right away, without putting it into a buffer. The buffered data contains character data, the nametable, and the audio data to be played in the other parts of the raster. As the reader needs to access a ROM bank, its code is located in the fixed ROM page that has a very limited space, so the reader's code is kept very universal and is not fully unrolled (six iterations unroled). In order to gain the fastest access to the different locations of the ROM bank, the self-modifying code is used, so it is placed into the RAM before execution.
The NTSC reader code is executed during the 160 visible scanlines, and uses up about 18176 CPU clocks. It reads 1536 bytes from a ROM bank, putting them to the RAM buffer, and plays 284 audio samples without buffering in the meantime. The PAL reader takes more time, 202 scanlines and 21568 clocks, even though it buffers the exact same amount of data. This is caused by the need to read and play 337 audio samples instead — the TV frame rate is lower, so more samples per frame to maintain a similar sample rate is needed. The extra 42 scanlines are located in the extended blanking period that is a specific feature of the PAL version of the NES.
The pushers duty is to transfer the buffered data from the RAM into the desired locations of the video memory as fast as possible. They’re fetching data from the buffer and streaming it to its locations — the audio data into the APU DAC every 64 clocks, the graphics into the video memory in the remaining time. There are 8 pushers, a couple per each of the four TV frames that is needed to perform a full video frame update. Each TV frame has a pusher that works in the top forced blanking half, and another that works in the bottom forced blanking half of the raster, the areas where video memory access is enabled. The code of both NTSC and PAL pushers is exactly the same.
The first of the top pushers sets up the palette, loads the nametable and color attributes, sets up a few extra bytes in the nametable to display the OSD icons. All the other pushers transfer different amounts of the character data to the video memory, 38 characters in all of the top pushers, 26 characters in all of the bottom pushers but the last, and the last one loads 20 more characters (26+38+26+38+26+38+20, 212 characters total).
A double-edged cart
Besides the quite unusual contents in the NES realm, the new release also features another gimmick, a limited run of the double-edged carts with a connector on both sides. I pitched this idea as a joke first, but as they say, every joke has some truth.
The reason here is that it was initially considered to use a 128MB version of the cartridge board, also courtesy of RetroUSB, and include the full album, so the content has been created for this large version. However, by some reason the player code that has been working just fine in the emulators and on the other boards (the older 64MB and in a separate small MMC3 test) just refused to work properly on the new board - it was glitching out, and the video playback just hang before ever starting. As at the moment the 128MB board is only accessible for its creator, and he wasn't able to figure out the issue so far, we had to take a decision of using the older smaller board and limit the release contents to just six songs. This brought the idea that the whole album would fit a couple of boards, then it turned into the idea of putting these boards into a single case and turning it into an extra feature of the project.