Not everyone is an artist
For one of my projects I was exploring Reddit to understand how players create characters in video games, what is important to them in this process, and what their preferences are. It turns out that communities sharing their creations or seeking help with specific character designs remain active even for games released years ago. Semi-online action-RPGs like Dark Souls, Elden Ring, and Dragon’s Dogma are particularly lively, as these games offer flexible character customization options and a chance that someone might see your character’s face once. I’ve also noticed a few trends:
Players avoid fine-tuning facial features, preferring to focus on striking makeup and hairstyles.
Players rarely ask for help choosing makeup but often request assistance in making their character resemble a celebrity or another character by providing a photo.
These attempts do not always succeed for players. The ability to upload an image of a character and get all the necessary settings is a fantastic feature! Therefore, I decided this would be an ideal task for a hobby project, which, at the time, seemed feasible to complete in a few days.
I aimed to create a universal pipeline for training models to tackle this problem. However, for my first attempt, I chose the Dragon’s Dogma 2 character creator for the following reasons:
The most active community and interest in character appearance.
The character creator was released as a lightweight, free demo application.
There was already an open-source Python application for uploading characters for this creator.
A quick inspection of the creator showed that creating an inhuman character is challenging as long as you are not using extreme parameter values.
All parameters and ranges have clear values and do not contain many implicit dependencies.
Existing Solutions
The task of building a character model is similar to face recognition one, which already has well-developed loss metrics based on convolutional networks and contrastive learning. There is also the task of creating a face model and texture from one or more photographs. [articles] Unfortunately, most articles on this topic quickly move into the realm of 3DMM, where authors have full control over model and texture construction, which does not quite fit my task as all the controls I have are the ones implemented in the game already.
In the context of a hobby project, I was interested in the following constraints:
Maximum use of pre-trained models.
Use of publicly available datasets.
To begin, the task needs to be formalised. The character creator can be viewed as a function that takes a set of options and parameters as input and outputs a textured face model, which the graphics engine then transforms into a flat image. Our goal is for the result on this image to resemble the provided image.
Face similarity determination is already handled by models like FaceNet or ArcFace. These models use CNNs to map the face image into an n-dimensional space where Euclidean distance is meaningful, and cosine similarity works well for normalised vectors. Since faces in the Dragon’s Dogma creator rarely look non-human, I expected that a pre-trained face recognition model would provide good results, which is not always guaranteed for game series like Dark Souls or Bethesda games.
The rest of the task remains unclear. After exploring arXiv, I found two papers by Tianyang Shi from NetEase that offer solutions to this problem. These papers complement each other.
Face-to-Parameter Translation for Game Character Auto-Creation proposes constructing an end-to-end differentiable model that includes a face simulation model based on parameters and renders it in the game engine. The result is then compared with the target image using a face recognition model. Since all parameters of the face creator are continuous and the entire chain is differentiable, gradient descent leads to the most similar face. To avoid problems with generating non-human faces, the authors added a loss function that checks for segment matching between the generated and target faces, restricting inputs to only frontal photos.
Fast and Robust Face-to-Parameter Translation for Game Character Auto-Creation addresses the obvious problem of the first paper—the computational cost required to find parameters for a single example. It is clear that gradient descent through several image processing models is not suitable for production. Therefore, the authors decided to train a translator model that inverts the face recognition model’s representation function into parameters, integrating into the existing computation model.
The ideas in these papers seemed like a good foundation but were either not quite suitable for my task or seemed excessive. While almost all parameters in these papers are continuous, the Dragon’s Dogma creator has many categorical parameters, such as face type or skin texture. The impact of such parameters can be assessed, but they are not differentiable in the reverse function. The steps for simulating the creator to use differentiable images in recognition and segmentation models seemed excessive, as they limit the input data set and increase model size. Additionally, ArcFace, the model I planned to use for face recognition, should be resilient to the factors the paper authors aimed to mitigate.
Experiments with the Dragon’s Dogma creator showed that the generated faces are within the expected range for the recognition model when parameters are kept away from extremes.
As a result, I outlined a plan to address these issues using three models:
An imitation model that directly translates parameters into the face representation space of the recognition model. A lightweight model that approximates the end-to-end face generation process of the creator and its representation.
A forward model to assess how closely a character with given categorical variables can resemble a target face. This task, except for the metric, is almost identical to classification.
A forward model that predicts continuous parameter values which, given the categorical variables, make the character's face closest to the required one.
As a result, each of these tasks can be solved as simple regression problems with an L2 loss metric.
Training the Imitator
Training this model requires a dataset consisting of pairs of “parameter set – resulting recognition model representation.” The model structure and loss functions are straightforward. All categorical variables are processed through one-hot encoding, and all others are normalised. The resulting vector is approximated using multilayer NN to match the face representation. It’s crucial to gather pairs of parameters and images, which are then transformed into representations. The key consideration in planning is how quickly each of these steps can be completed.
.Extracting a representation from an image is not exactly a very quick operation, with about 24 images per second processed on a GPU or around one second on CPU. Saving images to disk for debugging and preservation operates at a similar speed. A game running at 60 frames per second polls keys every frame, meaning reliable automation should limit to 20-30 menu operations per second. Some menu actions require additional operations from the game like loading which causes it to ignore key presses. Certain actions, such as changing face type or skin type, require a few frames to update.
Ultimately, by focusing only on important parameters and excluding makeup and hair details that do not help players, the following figures can be achieved:
Moving a single slider to the desired position takes around two seconds.
Setting the most important sliders for a random face takes around two minutes, limiting the process at 720 images per day.
A complete cycle over of all 38*40 combinations of skin and face types for one gender takes 5 minutes.
Given that combination enumeration is the fastest way to obtain images and covers the most important factors, 5 minutes for setting all continuous parameters to random positions and full enumeration of all categorical variables for both genders yields 3040 images. I ran two strategies for image acquisition over a week:
Testing extreme values for each of the 58 continuous parameters while keeping all other parameters default. This took two days.
For the remaining week, generating a random set of continuous parameters. This took the rest of the working week.
The result was a dataset of slightly over two millions of images, comparable to datasets used for training most face recognition tasks. However, it’s worth noting that in a 58-dimensional continuous data set, this dataset contains only about four hundred distinct value combinations, which is sufficient for linear approximation and quite minimal for assessing complex nonlinear dependencies.
Based on this dataset, an imitation network was trained with a mean squared error of 9.53 in a 512-dimensional space, where the average vector length is 23. This translates to a cosine error within 0.94, which is more than acceptable for recognition tasks. Inference of this model is lightning-fast, in milliseconds compared to sampling the ground truth values.
Single Pass Parameter Estimation
To further refine the process of generating character representations, I employ a two-step approach for each target face representation. Initially, I prepared a data batch comprising all 3040 combinations of categorical features and randomised/default continuous parameters. This comprehensive batch is processed through the imitator network, yielding approximations of the facial embeddings. By fixing the categorical features and keeping the imitator network frozen, I backpropagate the distance from the resulting embeddings to the continuous parameters. This enables the network to adjust only the continuous parameters to minimise the distance to the target embedding.
This way, it’s already possible to get character parameters by simply picking best fitting categorical features and continuous parameters that provide that best fit. However, this procedure takes a whole 8 seconds to go through 100 iterations of the optimiser for a single image. So it’s fast enough to evaluate, but clearly not fast enough to be used in any sort of service.
However, through this method I can generate new datasets tailored to arbitrary target facial representations. By reducing the number of iterations and increasing the learning rate, I was able to get half a million targets and their optimised parameters from my initial dataset and celebrity face targets obtained from the IMDB-WIKI dataset. This was enough to train both categorical parameter and continuous parameter networks. Boring part aside, it allowed the parameter search time to drop below one second on CPU, so matching the processing time for embedding retrieval without a significant decrease in quality.
Results
Holdout test on IMDB faces has shown that without embedding preprocessing the whole system reaches an average 0.358 cosine similarity between target faces and faces reconstructed in the game engine. Similarities are also firmly distributed within the 0.3-0.4 range. This is below the threshold of 0.42 that is usually used for 512-dimensional ArcFace embeddings. However, reconstructing faces from the game itself yields average 0.91 cosine similarity score signifying that the bottleneck is caused rather by functionality of the game character constructor than the applied method. Here are some examples of the model output based on a single image embedding. Averaging embeddings over multiple images of the same target yield better results, but I have not evaluated it yet.
There are also two other ways of double checking the achievable result range. First, by assuming that embeddings are distributed as a 512-dimensional standard normal distribution, the distribution is very close to uniform distribution on 512-dimensional sphere of radius sqrt(512) = 22.627, which allows to estimate cosine similarity between two random vectors using 1-dimensional normal distribution leading to following numbers:
For cosine similarity > 0.2 the probability is approximately 1 in 330,000.
For cosine similarity > 0.3 the probability is approximately 1 in 17,000,000.
For cosine similarity > 0.358 the probability is approximately 1 in 1,000,000,000.
For cosine similarity > 0.4 the probability is approximately 1 in 3,600,000,000.
So I would say if it does not represent the target person perfectly well enough to be immediately recognisable, but you will definitely struggle to find a better lookalike in the real world.
Another way to see what was the best achievable similarity is to check how well a 512-dimensional unit sphere can be covered by parameterization of 3040 initial points and 58 continuous parameters in the immediate neighbourhoods of those points. And it shows that cosine similarity of 0.358 is quite a reasonable result.
Conclusion
So this was a fun free-time project that took a couple of weeks spent mostly collecting the data, most of the effort was dedicated to collecting the data as well. Results seem to be good enough to lend a helping hand to the gaming community. The main roadblock is posed by the face recognition model as the largest ArcFace model is a total overkill and a much faster model trained on a public dataset is needed to publish it as a service.