Can AI create a rap music video? / Habr

Hello everyone, this is Denis Weber.

One day, I was once again looking for a 3d model on stock for my project and came across a neural network that can create high-quality 3d models in just a couple of clicks. And I wondered if it would be possible to create something like a music video using only the capabilities of existing neural networks?

If you prefer the video format, I will leave a link to the video at the end of the post.

A few years ago, people were laughing at the possibilities of AI, creating two-headed dogs and three-toed people in the very first version of Midjourney.

Today, some artists, developers and other people in creative professions seem a little bit scared of how fast artificial intelligence is developing.

People imagined that neural networks would be used by robots and would, for example, shift heavy parts on a factory conveyor. But with such a rapid rate of development, sometimes it is simply impossible to predict what will happen next.

One day, I was once again looking for a 3d model on stock for my project and came across a neural network that can create high-quality 3d models in just a couple of clicks. Anyway, that's what was written on their website. I have seen such neural networks before and the quality of the models left much to be desired.

I tried to create such model and was really surprised by the result. Now the model was not just a set of polygons that looks like a reference, but something quite recognizable.

Some time ago, I was creating music using another neural network and I came up with an idea. And will it be possible to create something like a music video using only the capabilities of existing neural networks? I had no idea if I would be able to do anything at all that I would not be ashamed to show you. But it was supposed to be fun anyway.

The goal was to minimize the amount of my work and delegate it to AI. After creating the music video, I will finally be able to answer the main question: Will AI replace humans?

I wanted to use all the neural networks I know.

ChatGPT for creating lyrics for a track, location description and other small tasks. Midjourney and Copilot Designer to create references of locations and objects on the stage. Suno to create a music track. Rodin to create 3d models. Mixamo for animations. Built-in Blender tools. And Vocalremover for working with voice and music.

Rap and hip-hop are always among the tops of the most popular music trends, so I decided to keep up and choose them. I began with the lyrics. Usually the rapper writes about cool cars, money and tough fate. I decided to remove all unnecessary things and left only the tough fate of a man who talks about his cat.

To write a platinum track is not a five minute task, so I had to try my best. I wanted to add more hate to the lyrics. The owner of the cat should definitely be angry that his pet did not obey him.

Suno was great for creating neuromusic. I tried several times to create the track I needed and it did something interesting, but the style was still not exactly what I was looking for.

A few more attempts and I finally got what I wanted. Great figures of speech, a mood of despair and at the same time the anger that something like this happened. It remained to create the same beautiful music for this masterpiece.

And to my surprise, on the first attempt with a new lyrics, SUNO gave the best of what could be composed.

When I decided on the track, I wanted to visualize the location, at first for myself only, and asked MidJourney to create images of a low poly scene for me.

I didn't really count on the capabilities of a neural network that works with 3d right away, so I assumed that I would have to create each model on stage separately.

I tried to look for inspiration in Microsoft's Copilot, but it didn't generate anything that could suit my vision at all. The big advantage of Midjourney.

It's time to decide what the main character of the video will look like. It was difficult to understand from the track what kind of appearance the performer should have, but the voice was more like a woman than a man.

Since I planned to generate a model using the Rodin, it was important to me that it would be in a t-pose.

I tried to use all the same neural networks, but then I remembered that in Rodin itself you can create images based on a text query and immediately generate 3d models from them.

It took some more time to choose and create a model for the main character. I wanted to get some kind of bright appearance along with a stylish look. And of course, she had to look like the owner of the cat. And here she is. A few more seconds to create textures and the main character was ready.

The next step is to create animations. As I said, I used Adobe's Mixamo service. All you need is to download the finished model, place the dots for the Auto-rigger and you're done.

The second main character of the music video is a cat. The Rodin has strange concepts about cats. Just a few attempts and here he is a red-haired bastard who did not want to learn how to go to the toilet properly. If you put the camera in a certain angle, it won't be so bad.

If Rodin could generate an entire scene, it would save me several hours of work. I tried uploading images with an isometric view. But apparently this neural network has not yet succeeded enough in creating such complex models. Therefore, as I thought, I had to generate each of the parts in the scene as a separate model.

I wanted my video to look at least approximately like a music video. And for this it was necessary to make the main character sing. I edited the model a bit and added a mouth. And I created the lip movement using Shape Keys and rearranging one slider.

In order to correctly create lip animations for Lipsync in Blender, I separated the music from the vocals using a neural network from the vocalremover website. It didn't have much effect on the final result, but honestly, I wanted to use as many neural networks as possible in my project.

I uploaded an isolated vocal track and with a few clicks created a lip animation based on the audio track. Maybe it didn't look quite right, but it was quite suitable for the music video.

In 3d, skeletal animation is used for the movement of characters. It was created by Mixamo when I uploaded the model there. Unfortunately, Mixamo is not yet able to create an animal skeleton, so I had to create it myself.

And then I faced a problem which appeared due to the use of AI. When I tried to bind the skeleton to the model, appeared an error which indicates the geometry problems. So I asked Rodin to generate a new model, with which everything was already fine.

Mixamo even has a separate section with dance animations. And that's exactly what I needed. No more than eight or ten animations were enough for the video.

It took the most time to create models for the stage. Sometimes AI gave out something like this. And it simply refused to create some objects like piles of garbage. But overall it did it's job well. As I said before, it was not possible to create a completely finished scene with Rodin, so I had to try my best.

Based on the reference from the neural network, I created pieces of walls, garbage cans, barrels, tires and many other small objects that added certain vibe to the scene.

By the way, write in the comments how many objects do you think I created for this music video just by myself. I think you will be very surprised. Neural networks can't work with 3d objects on stage yet, so I had to arrange them myself.

I am sure that in a couple of years, or maybe even earlier, such scenes can be created in two clicks in the next new version of the neural network.

According to the idea, the location of the video is kind of a ghetto, and in such places, in addition to the gloomy vibe, you can often find interesting graffiti on the walls. I asked Midjourney to draw me some graffiti, which I then added to the stage.

I had to create a lot of models using AI. The logic was: to generate a picture that I like, generate a 3d model and if it looked more or less fine, add it to the scene in Blender. I put up windows, added lanterns, gates, an old wrecked car, scattered newspapers, a neon sign in the form of a cat, a trash can and much more.

I think it makes no sense to list all the objects, because you will see them in the video very soon.

When I placed the objects on the stage, I begun to adjust the light. I added lights to windows and lanterns, turned on neon signs, and added several other sources of light. And of course I wanted to add as many interesting objects to the scene as possible.

It's time for the most creative part of the whole video - adding animations and setting up cameras.

By the way, for the project I created no more than 10 polygons for the entire scene. For planes of windows, graffiti and posters. The neural network did the rest of the work for me. It's really impressive.

Unfortunately or fortunately, AI does not have its own personal vision. It can take ready-made works and do something similar based on them, and that's all. It's a great advantage of a human being.

If you have a favorite musician or artist, you like their creativity and their so named spirit. But the neural network does not have a spirit which is noticeable.

I think that with the help of neural networks it is possible to automate some processes or simplify the work of people, but not completely replace them.

I have already dived too deeply into the philosophy of the confrontation between machines and humans, so I don't want to keep you waiting any longer and I will show you a music video that was created using AI, but of course not without my help.

I'm glad I finished this project. And it was really interesting for me to try to create it only with the help of AI. Write in the comments whether AI was able to surprise you too. And what do you think about replacing humans with AI?

Can AI create a rap music video?

{{ titleHtml }}

{{ titleHtml }}