Thursday, March 28, 2019

Resource Pools of Game Audio

I gave a presentation awhile back at the 2017 Austin Game Conference on the difference resource pools available to an audio engine and how to balance their usage during development.

The presentation slides are available here:

...and PDF:

Here are the presentation notes for posterity...these probably read better in-conjunction with the slides, but you can probably get the gist:

The goal today is to shed some light on a few of the technical considerations that can have a profound effect on the resulting sound of any game but especially those facing limitations in any of the three major areas of storage, memory, and runtime processing.

If you’re currently in the process of creating an game, app, or experience and haven’t faced these challenges I hope this talk will surface some considerations that can help increase the quality of audio through the understanding of these valuable resource pools.

Resources, pipeline, and workflow are all fundamental components of game development and the constraints they place on game audio can conspire to enable creativity or disaster, depending on whether you can get your head around the resource limitations you’ll be faced with.

Well understood constraints can enable creative decisions that work within the confines of development whereas a failure to understand the confines of your capacity can lead to last-minute/ hasty choices & increases the potential for overlooking the obvious.

In the multi-platform marketplace, that there is rarely a single cross-platform solution to service all scenarios. This is especially true when discussing these fundamental resources.

The important thing is to understand how the three resource pools interact and how the unique qualities of each can be leveraged to create a great audio experience for the player.

I wanted to get some terminology out of the way up-front in the hope that when I start throwing around verbiage it all locks into place for you.

If you’re used to hearing these terms, please bear with me as we bring everyone up to the same level of understanding.

Media - AKA Audio Files
Seek Time - how long it takes to find the data either in RAM or storage.
Voices/ Instances - A single sound file being rendered
Render - audio being played
Synthesize - is the act of combining one or more things
(Audio) Event - is a currency of communication between the audio engine and the game engine.
Encoding - is the process of converting/ compressing PCM audio to reduce size
Decoding - is the process of converting a compressed format to PCM for playback
DSP - is Digital Signal Processing and is commonly used as part of a PlugIn and can be used to modify sound either in realtime or as part of Encoding.
Streaming - allows for the playback of large files using a small portion of RAM
Buffer - A reserved portion of RAM for processing audio streams or the final render.

Before a game or app can be played or executed it must first arrive on your local disk in order to run.

Whether by disk or download the consideration of total file size for audio is one that will spring up time and again across every platform.

The size of you final audio media deliverable for a title is a difficult number to pull out of thin air at the beginning of a project.

I’ve seen some great interactive spreadsheets auto-sum huge collections of theoretical data into precise forecasts of storage needs rendered moot in the final moments of development by a late-breaking scope change or decision making higher up the food-chain.

That doesn’t mean that the exercise lacked merit, in-fact, it helped establish the audio team's commitment to being solutions driven with a great depth of understanding over their (potential) content footprint.

It’s in your best interest to begin discussions about resources for audio as soon as possible. Messaging any findings, and re-engaging the wider development team on a regular basis can help remind people about audio’s contribution to resources.

Anyone who has ever shipped a title will tell you that storage space is often at a premium for audio, whether that’s due to the limitation of physical media or the over-the-air-data concerns of mobile development.

Cost of size over cellular or thousands of wavs on a BlueRay.

Sample-rate, Compression, and variation management play a significant role in the final audio size and often means trade-offs and sacrifices in-order to get things in-shape.

And even when you’ve got the initial storage size in-place for distribution, there’s the additional step of getting the stored data onto the device and ready to be accessed by the app.

There is the question of whether the data now needs to be copied from the physical media or packed format in preparation for launch.
The speed at which this process is able to be executed is tied directly to the speed of the HDD, SSD, or DVD.

Requires communication to make it work
So again, some storage considerations that are important to keep in-mind:

  • Download Size
  • Unpacked Size
  • Final Data Size
  • “Seek” Speed (pretty fast)

  • Amount of time to download over cellular
  • Amount of time to download over WiFi
  • Amount of time to download over broadband
  • Amount of time to copy from physical media to HDD/SSD
  • Amount of time to unpack to HDD/SDD
  • Seek speed of storage media
  • Available storage on physical disk
  • Available storage on device
Keep in mind, that in mobile (and potentially on console as well) it’s pretty common behavior to delete the largest app in order to create space for the new hotness.

When the project directors start looking for ways to increase the potential for the app to stick around, whether it’s on device or console, you can bet they’ll go looking for the largest contributors to the media footprint.

Any audio folks in the room will know that…

These are some of the largest contributors to the audio content footprint.

As one of the largest contributors to size on disk the question of storage, file size, and their resulting quality is one of the foremost concerns of most content creators.

While each generation edges closer towards the ability to leave this concern behind, there will always be the need to optimize size considerations for new platforms and playback mechanisms.

Storage size is a concern

Things are better, but there’s always new restricted platforms to consider

R.andom A.ccess M.emory

RAM is the interim location of all sound media and also reserves memory for use by the audio engine.

RAM is often the most valuable resource due to it’s speed and fluidity. It allows for the storage of media that can be played back and processed on-demand with low-latency during gameplay.

In addition to storing instances of sound files, RAM is also used to store audio variables as well as some decoding & DSP processing resources that need to be available to the audio engine.

Some of the benefits & uses of RAM include:
  • ●Faster Seek
  • ●Temporarily Store Media
  • ●Streaming Buffers
    • ○Size
    • ○Seek Speed
  • ●Audio Engine
    • ○Sound playback
    • ○Audio variables
    • ○Simultaneous voices
    • ○Decoding
    • ○Effects processing (DSP)
The speed of access makes this pool a valuable resource & fundamental to the eventual sound that is rendered by the game. 

The amount of RAM allocated for audio also ultimately determines the maximum number of voices that can be played back by the audio engine.

In short, RAM is comprised of:

  • MEDIA - Instances of audio files
  • VOICES - Maximum number of physical voices
  • DECODING - Compressed audio being decompressed at runtime
  • DSP - Processing executed on audio being rendered
As the interim memory location for both audio files and data concerning the playback of audio, RAM is another critical component of the resources used by audio during gameplay.

While RAM allocation on console has increased to match low-spec PC’s, mobile continues to face restrictions that require the thoughtful use of what is available.

The Central Processing Unit is the brains of the computer where most calculations take place.

It is the powerhouse of runtime execution responsible for synthesizing the audio and rendering it to the output of a device.

This means everything from applying DSP, calculating variables such as:

  • Volume, pitch, position for each voice
  • Keeping virtual voice information available so they can be smoothly returned to physical voices if necessary
  • Applying DSP across voices
  • Decoding of every compressed audio file from RAM in preparation for rendering the final output
  • Streaming of and media from storage, through buffers, and synthesized along with the rest of the audio
  • as well as the manipulation of all data moving in & out of the other resource pools.
The total number of physical voices playing simultaneously is the greatest contributor to an increase in CPU and can multiply other aspects that also affect CPU, such as: DSP and decompression required to render a voice at runtime.

The fastest way to reduce CPU is often the management, limiting, and limiting behavior of physical voices that are being requested for playback by the application.

It is imperative that the robust ability to creatively and comprehensively control the number and types of sound situationally be provided early to allow for good decision-making in advance of optimization later in the project.

DSP at runtime provides flexibility and malleability over sound that allows for the extension and manipulation during the playback of linear audio (media or sequenced synthesis).

Hardware vs. Software Decompression

Sound files can be huge

Until we have the data throughput to push around gigabytes of audio data there will continue to be a quality compromise between size and fidelity.

This challenge mirrors the fight for fidelity over MP4, MP3, Vorbis and other “lossy” compressed formats across other media.

The fidelity of a codec should be evaluated in-context i.e. most sounds aren't played alone, therefore their compression artifacts (if any) may well be masked by other sounds (psycho-acoustics and all that).

This opens an opportunity for cranking up the compression a notch to either save more space or CPU (sometimes both). However, this may be too specific for the level of this presentation and its targeted audience?

In addition to the loading of media into RAM for playback, sounds can also be streamed from physical media (BluRay) or across a network.

Streaming is done by allocating a “Buffer” of portion of RAM that is used to pass through or stream sequential audio data while rendering the sound to the output.

Like the tape-head on a cassette deck, the data is streamed through the buffer and played back by the audio engine.

While the increase in CPU performance has slowed over the past few years, the need to optimize audio’s use is greater than ever. As the centerpiece of every platform the needs of the CPU and processing for audio continue to be fundamental to the execution of data and the rendering of audio.

Now that you have a clearer idea of the role these three resource pools play in the rendering and playback of sound, it’s important to understand how they can be optimized before it becomes a problem for your development.

Here are a few suggestions for ways that your can reign in your resource budgets.

The first area that can have the greatest impact on CPU & RAM is in the optimization of voices across the entire project

Voices can be optimized globally, per-game object, based on mixer buss associations, or at the Event level or for the entire project.

Additionally, voices can be removed from the processing queue based on their volume or based on the quality of the device/ platform. 

Should be optimized throughout production. Limit early/ limit often. (Mix early/ Mix often)
Old School NES example as well as non-verbal communication

Voices limited globally due to hardware restrictions but illustrates the point.

It’s easy to imagine quickly filling up the number of voices available unconditionally when working on a game with waves of NPC’s, mass destruction, and complex interactions.

But what if your experience needs to play and communicate using audio across both high-end as well as low-end devices?

By detecting the quality of device and using that variable to scale the maximum voices and then coupling these values with a way to prioritize the voices that are heard on low-end devices you can create a system that allows the right voices through in order to communicate their intention of audio.

In this example, we’ll first hear the high-end device version of a music piece with a huge number of the voices being utilized.

Second we’ll hear a low-end device version what the music would sound like using a very limited number of voices.

Additionally voices can usually be limited by game object and can include behaviors when limits are reached in order to achieve the correct effect or sound.

Discard oldest instance to stop the oldest playing instance with the lowest priority.
Discard newest instance to stop the newest playing instance with the lowest priority.

Ultimately, the limiting of voices can be used creatively in-order to shape the sound of the game in an appropriate way.

One technique that proved to be invaluable on mobile (PvZ2) was the loading of compressed audio from storage & decoding it directly into RAM for low latency (uncompressed) playback.

While the sound quality between compressed and uncompressed was maintained, this allowed for sounds that were played back frequently to pay the cost of decoding only once when the content was loaded/ copied into memory (instead of each time the sound was requested to play).

For commonly played sounds, this had a direct effect on the amount of CPU that was used at runtime (lower) while we were able to deliver (smaller) compressed audio footprint on device.

When decompressed we did expand the audio footprint X10 into RAM but the trade-off between CPU & Storage/ Download made this an acceptable compromise.

It was once common to reserve the streaming of audio for large sound files that would take up too much space in RAM.

As resources have become more plentiful in the way of multiple cores, giant hard-drives, and copious amounts of RAM, streaming ALL audio files or optimizing the way that sound files are accessed by the game at runtime is evolving to help solve some of the problems associated with manually managing your own soundbanks.

Several AAA titles have had success with streaming their audio into RAM on-demand, keeping it around until it’s no longer needed, and then unloading it when it makes sense.

This helps to keep storage low by only ever having a single version of a sound on disk.

It also helps keep RAM usage low at runtime because only the sound files that are still in-use will be loaded.

I remember hearing about the idea of “Loose-loading audio files” in 2009 right here in Austin in a presentation given by David Thall where he had achieved this media loading strategy at Insomniac Games. Since then audio middleware manufactures have added development tools to help solve the file-duplication problem that can arise from manually managing soundbank associations for media and leverage the increasing speed of CPU’s in order to manage data more efficiently.

Limiting the number of audio files in your app doesn’t have to mean reducing the number of variations for a given sound.

The ability to perform sound design directly within an audio toolset allows for the combination & dynamic recombination of sound elements.

This “granular” or element-based approach, where elements _of_ a sound are used as a library within the audio engine, can be creatively combined at runtime and net big saving in storage.

Whether it’s creating a library of material sounds that can be dynamically combined depending on surface type or the creation of instrument soundbanks that can be played back via MIDI files the creation of sound components that can be dynamically combined by the audio engine at runtime can offset the need to create large, linear sound files and instead leverage the capabilities of today’s full featured audio engines and toolsets.

Additionally, the coming procedural & synthesis explosion is soon to be upon us and in some places the “runtime funtime” (™ Jaclyn Shumate) style of sound design is already leading the charge.

With multiple companies pushing towards accessible authoring of modeled and synthesized sound with incredible results, it’s only a matter of time before we’re offsetting our storage needs with realistic sounding approximations for little to no file size.

Replacing media with procedural models or synthesis not only gives you the flexibility of parameterizing aspects of the sound dynamically but also reduces the storage footprint.

As the authoring and quality of these techniques continues to grow, there will be less and less dependency on rendered audio and more focus on generating sound at runtime.

We’ve now gone over the functions and interdependency on the three main resource pools that can affect on the audio for you game, application, or experience.

Additionally, we looked at some opportunities to optimize different resources towards maximizing the resources available.

But the hidden message throughout all of this is audio’s dependency on these resources in-order to make great audio and the way that is in-service to the development team and relies on understanding, communicating, and advocating for the resources needed.

Hopefully this has helped give you a deeper appreciation for the challenge & equips you for the discussions yet to come.

Here are some additionally resources to help you go further into optimization: