How I Built an “Anime Factory”: a System That Automatically Turns Episodes into YouTube Shorts / Habr

Hi, Habr!

Over the past few months, I have been building a system that I internally call an “anime factory”: it takes a source episode as input and produces a ready-to-publish YouTube Short with dynamic reframing, subtitles, post-processing, and metadata.

What makes it interesting is not just the fact that editing can be automated, but that a significant part of this work can be decomposed into engineering stages: transcription, audio and scene analysis, strong-moment discovery, “virtual camera” control, and a feedback loop based on performance metrics.

In this article, I will show how this pipeline is structured, why I chose a modular architecture instead of an end-to-end black box, where the system broke, and which decisions eventually made it actually usable.

Where the idea came from

For a long time, I kept running into the same problem: any digital product without users is effectively dead. You can build backend systems, automation, and pipelines all day long, but if the project has no distribution channel and no audience attention, it barely moves forward.

My first attempts to automate content-related tasks started back in 2020. At that time, they were simpler ideas around TikTok, Telegram, and content promotion. But manual work hits a ceiling very quickly: finding moments, cutting clips, adding subtitles, converting to vertical format, packaging, publishing — all of that takes too much time and barely scales. One person can produce a few videos per day. A system can produce dozens or hundreds.

At some point, I formulated the problem correctly for myself: I did not need an “editing script.” I needed an actual production loop that turns long-form video into a stream of short clips with minimal manual involvement.

That is how the “anime factory” was born.

What problem the system actually solves

In simplified form, the task sounds like this: take a long horizontal episode and automatically turn it into a short vertical video that works as a self-contained Short.

But once you decompose it into engineering subproblems, a whole set of non-obvious requirements appears immediately:

You need to understand where the episode contains potentially strong moments.
You need to select fragments that work as a micro-story, not just as a random chunk torn out of context.
You need to adapt 16:9 into 9:16 without losing the main character, the emotion, or the visual focus of the scene.
You need subtitles that are quick to read and do not kill the image.
You need to assemble all of this into a stable batch pipeline where individual stages can be restarted independently.
You need to teach the system to analyze publishing results and adjust future selection logic.

At that point, it becomes clear that this is no longer “just a little editing script,” but a fairly mature engineering system with its own artifacts, errors, quality degradation modes, fallback mechanisms, and feedback loops.

Why simple automatic clipping does not work

From the outside, it looks like the problem should be easy to solve. For example:

split the video into equal 30-second chunks;
pick the loudest moments;
crop to the center;
overlay auto-generated subtitles.

In practice, that approach almost always produces garbage.

A loud moment is not necessarily an interesting one. An interesting moment does not necessarily have a good visual focus. A line can be strong only in the context of the previous five seconds. A character’s face can drift out of a centered crop. A scene with two characters falls apart completely if you simply keep a static window in the middle.

So the core idea behind my pipeline was this: do not rely on a single signal. Do not select moments only by text. Do not crop only by center. Do not try to make one model guess the entire process end to end. Instead, combine several relatively independent signal sources into a decision-making system.

Architecture: what the “factory” consists of

Overall block diagram of the pipeline: Episode -> Transcription -> Audio Analysis -> Scene/Face Detection -> Candidate Scoring -> Dynamic Crop -> Subtitles/Post-processing -> Export/Publish -> Analytics Feedback Loop.

Loop	Purpose	Main output
Production	Generate videos from the source episode	A ready Short
R&D / Analytics	Analyze published videos and update heuristics	New weights and trigger dictionaries
Community	Automate interaction around the channel	Replies, warm-up, engagement

At a high level, my system breaks down into three major loops:

Production loop — the main line that generates videos.
R&D / Analytics loop — analysis of already published videos and heuristic updates.
Community / Interaction loop — additional automation around audience interaction.

Let’s go through each of them in more detail.

1. Production loop: from episode to finished Short

This is the heart of the whole system. This is where the source media content goes through all processing stages and becomes a final vertical video.

Stage 1. Getting the source material

To make the pipeline easier to debug, I intentionally avoided the “one giant script that does everything” approach and instead went for explicit intermediate artifacts.

episode_001/
  source.mp4
  transcript.json
  audio_features.json
  scene_cuts.json
  faces.json
  candidates.json
  crop_path.json
  subtitles.srt
  metadata.json
  final_short_01.mp4

This structure is important not for aesthetics, but because it allows individual stages to be recomputed independently. For example, I can rebuild crop_path without retranscribing the entire episode, or change subtitle logic without rerunning scene analysis.

At the pipeline entrance, an episode arrives. For the system, it is just raw material: a video file that must be parsed, indexed, scored, and turned into several potential short-clip candidates.

Even at this stage, it was important not to build something that simply “downloads the file and moves on,” but to introduce a proper artifact structure. For each episode, the system stores separate intermediate results: metadata, transcripts, timestamped clip candidates, CV analysis results, detected faces, crop parameters, and final renders. That may sound like a boring infrastructure detail, but it is exactly what makes the system maintainable.

If I had to rerun the entire episode from scratch every time, development would have been painful. With this design, I can recompute only dynamic cropping or only subtitle logic without touching the rest of the pipeline.

Stage 2. Transcription and working with speech

The next layer is turning audio into timestamped text. At this point, the system gets not just one continuous transcript, but speech segments tied to time. This matters for two reasons:

First, the text itself already provides a strong signal about scene content.
Second, the same segments are later used for subtitles and for binding semantic fragments back to the video.

But I quickly discovered that “take the transcript and search for interesting lines” is not enough.

Multimedia content has an unpleasant property: the emotional force of a scene is not always in the text. Sometimes the text is neutral, but the scene has powerful music, a tense pause, a camera cut, or a strong facial expression. Sometimes it is the opposite: the line itself is strong, but without visual context it does not work.

So for me, the transcript is one signal — not the single source of truth.

Stage 3. Audio analysis

In simplified form, one of the internal audio passes looks like this:

def extract_audio_signal(window):
    speech_density = measure_speech_density(window)
    loudness_peak = detect_loudness_peak(window)
    energy_delta = detect_energy_change(window)

    return (
        0.45 * speech_density +
        0.35 * loudness_peak +
        0.20 * energy_delta
    )

Of course, the real implementation is more complex: it includes normalization, thresholds, protection against false spikes, and combinations with other signals. But the core idea is the same: audio is not used as a standalone oracle, but as another layer in evaluating a moment.

*A timeline with audio peaks and highlighted windows where the system sees increased emotional density.*

In parallel with text, the system analyzes the audio track itself. I look not only at the presence of speech, but also at the energy structure: loudness peaks, emotional spikes, transitions, sections with pronounced sound dynamics, musical pressure, and so on.

The purpose of this stage is not to blindly choose the loudest chunk, but to add another axis of evaluation. In real videos, what often works is the combination of:

a strong short line,
a pronounced audio transition,
a visual accent in the frame.

If you use only text, you miss these scenes. If you use only audio, you collect meaningless explosions and screams. Together, the signals work much better.

Stage 4. Computer Vision: scenes, faces, and visual events

In simplified form, useful visual signal detection looks something like this:

def analyze_frame(frame):
    faces = detect_faces(frame)
    scene_score = detect_scene_change(frame)
    face_focus_score = estimate_face_focus(faces, frame)

    return {
        "faces": faces,
        "scene_score": scene_score,
        "face_focus_score": face_focus_score,
    }

In practice, what matters here is not just the fact that face detection exists, but how that data is used downstream: can we confidently build a vertical crop window, does it make sense to hold on one character, is there a transition between characters, does the composition fall apart?

The next major block is computer vision. Here the system solves several tasks at once:

detects scene changes;
determines whether there is a face in the frame and where it is;
estimates whether the frame is suitable for vertical focus;
extracts visual features that later participate in candidate scoring.

In practice, this turned out to be one of the most useful layers in the whole system. Without faces and scene analysis, vertical adaptation was too crude. A centered crop destroys a large part of the image’s meaning: one character may stand on the left, another on the right, while the center of the frame contains almost nothing interesting.

Once the system started tracking faces and their positions, it became possible to build a “virtual camera” — not just crop the video, but imitate camera work within the original frame.

Stage 5. Finding clip candidates

Signal	What it evaluates	Why it matters
Transcript signal	Density and meaningfulness of lines	To understand whether there is a semantic hook
Audio signal	Emotional peaks and dynamics	To avoid missing strong audio-driven moments
Face signal	Presence of the main character in frame	To determine whether vertical focus is feasible
Scene signal	Scene changes and visual density	To avoid empty or visually weak windows
Pacing signal	Tempo and internal rhythm of the fragment	To filter out sluggish or overly chaotic parts

After text, audio, and CV signals are collected, the system forms candidate clips.

This is not one timeline pass with a simple rule like “every 30 seconds take the best fragment.” Instead, the video is decomposed into potential windows, a feature set is computed for each one, and then a final score is calculated.

In simplified form, the logic looks like this:

score = (
    transcript_weight * transcript_signal +
    audio_weight * audio_signal +
    face_weight * face_signal +
    scene_weight * scene_signal +
    pacing_weight * pacing_signal
)

Naturally, the real system is messier: it has penalties, thresholds, fallback heuristics, length limits, empty-fragment checks, duplicate filtering, and re-evaluation of neighboring windows. But the core idea is exactly this: do not make the decision from a single feature, but combine several relatively weak signals into one more stable score.

An important nuance: the system is not looking for simply “an interesting 20 seconds,” but for fragments that have a chance to feel like a complete micro-episode. That strongly affects output quality. A Shorts viewer does not need to know the context of the full episode, so the clip should still hold together on its own.

Stage 6. Dynamic reframing — the “virtual camera”

Internally, this is closer to a constrained state machine than to magic:

def update_crop_window(prev_window, target_focus, dt):
    desired_window = build_window_around_focus(target_focus)
    smoothed_window = smooth_transition(prev_window, desired_window, dt)
    limited_window = limit_shift_speed(smoothed_window, prev_window, dt)

    return clamp_to_frame(limited_window)

Three things are fundamentally important here:

the system must not twitch because of noisy detections;
the window must not move faster than a visually comfortable speed;
when the face disappears, a fallback must activate instead of a chaotic jump.

*A mini-diagram of the crop window moving across several adjacent frames.*

This is probably the most interesting and also the most temperamental part of the whole system.

If we have a horizontal video and want to turn it into a vertical Short, there are several options:

do a dumb centered crop;
choose one static focus area;
try to control the crop window dynamically.

The first two approaches quickly showed their limitations, so I moved to the third.

The “virtual camera” logic is roughly as follows:

if there is one obvious character in frame, the camera tries to keep them in focus;
if there are multiple faces, it chooses a strategy somewhere between holding the main object and smoothly shifting between characters;
if faces disappear temporarily, fallback logic kicks in so that the camera does not jerk around;
all movement is smoothed to avoid the feel of broken auto-tracking.

From an engineering perspective, this turned out to be much closer to state control than to “magical AI.” Inertia, stabilization, shift-speed limits, protection against shaky detections, and proper handling of object disappearance are all crucial.

The most annoying part of this module is that a formally “correct” solution does not always look good visually. The camera can mathematically follow the face perfectly and still make the clip unpleasant to watch. So I had to balance tracking precision against visual smoothness.

Stage 7. Subtitles and post-processing

For subtitles, it is important not only what is written, but how the text is split into lines and timed. In simplified form, the packing logic looks like this:

def build_subtitle_lines(segment, max_chars=24):
    words = segment["text"].split()
    lines = wrap_words(words, max_chars=max_chars)
    return highlight_keywords(lines)

In reality, this layer also accounts for line breaks, line length, readability on a mobile screen, synchronization with speech, and visual emphasis of key words.

Poor version	Better version
Long lines taking up half the screen	Short, readable lines
Random line breaks	Meaningful breaks by phrase
Tiny text	Phone-readable size
Uniform presentation	Highlighting key words

After selecting the moment and building the virtual camera trajectory, the system moves into final packaging.

At this stage, more familiar steps kick in:

subtitle rendering by timecode;
line-length limits and line-break control;
visual emphasis of important words;
loudness normalization;
watermarking;
speed correction and additional video effects;
final export.

Subtitles, by the way, turned out not to be a decorative detail, but a part of the attention-retention mechanics. Poorly typeset auto-subtitles kill perception very quickly. Well-assembled ones, on the contrary, hold the viewer’s gaze even when the person is watching without sound or only half-paying attention.

That is why this layer is not just “burn the transcript onto the video.” It has its own composition and presentation logic.

Stage 8. Metadata and release

After rendering, the clip receives packaging data: title, description, set of tags, and auxiliary fields for publishing and notifications. An important detail here is that video production does not end with the mp4 file. For a normal content pipeline, you also need a packaging and delivery layer that moves the result further through the system.

That is why the pipeline has separate steps for preparing metadata and notifications, so that the process does not get stuck at a manual “I’ll title and upload it later.”

Why I chose a modular architecture instead of one big ML model

Whenever you describe a project like this, the question comes up almost immediately: why not make it end-to-end? For example, feed the video to a model and ask it to output a ready-made Short.

The answer is very practical: because from an engineering standpoint, that would be much less convenient.

A modular architecture provides several critically important advantages:

each stage can be debugged independently;
a weak module can be replaced quickly without rewriting everything else;
intermediate artifacts can be stored and reused;
it becomes much easier to understand why the system made a particular decision;
fallback scenarios and fail-soft behavior become possible.

If face detection performs poorly, I improve the CV layer. If the selected moments are weak, I change scoring. If the videos are jerky, I refine the virtual camera. It is a very engineering-driven approach: less magic, more observability and control.

For a production system, this path turned out to be much more practical than one opaque black box.

Architectural principles without which this would quickly collapse

Over the course of building the system, I developed several principles without which a pipeline like this turns into an uncontrollable monolith very quickly.

1. Independent stages

Each stage should be able to work as an independent pipeline step. This allows me to rerun only the needed part of processing instead of wasting resources on the whole loop.

2. Artifact persistence

Transcripts, detected faces, candidate windows, crop trajectories, final timecodes — all of that must be persisted between steps. Without that, any debugging process becomes torture.

3. Fail-soft instead of fail-fast

For an experimental pipeline, it is not enough to “fail красиво” — it must be able to degrade into an acceptable result. If no face is found, use a fallback crop. If tracking is jerky, smooth it and limit the speed. If a confident signal disappears, reduce its weight and continue.

4. Simple heuristics are often more useful than “complex magic”

In many places, the most stable results did not come from heavy models, but from a combination of sane constraints, good thresholds, repeatable rules, and careful scoring.

2. Analytics loop: how the system learns from its own publications

If the story ended there, this would simply be a good clip generator. But for me, it was important to go further and build a loop that not only produces content, but also gradually adapts its heuristics based on what actually performs well.

That is why I introduced a separate analytics worker.

Its job is to periodically traverse published videos, collect data from the strongest-performing ones, and extract patterns that can later be used when forming the next batch of candidates.

In practice, this layer solves tasks like these:

collecting successful videos from the channel;
analyzing subtitle length, structure, and vocabulary;
looking at which characters, words, scene types, and pacing patterns appear most often in successful publications;
updating internal weights and trigger dictionaries;
feeding those updates back into the production-loop scoring logic.

It is important to emphasize here: this is not “full self-learning” in the academic sense. It is closer to an engineering feedback loop that allows the system to become less static.

For example, if successful videos repeatedly feature specific characters, types of lines, or pacing styles, the system starts weighing those signals more heavily during the next selection cycle.

def update_trigger_weights(top_videos):
    trigger_stats = collect_trigger_stats(top_videos)
    return normalize_weights(trigger_stats)

This is not “training a neural network from scratch,” but an engineering mechanism for adjusting weights based on the observed behavior of already published videos.

In essence, the loop looks like this:

production -> publishing -> metrics -> analytics -> heuristic updates -> production

For me personally, this became one of the most interesting parts of the project, because this is exactly where an “editing script” turns into a system that accumulates applied knowledge about its domain.

How scoring works and why it changes all the time

Candidate scoring cannot be fixed once and then forgotten. On paper, you can always invent a beautiful formula, but the real viewer does not watch the formula — they watch the clip. Audience behavior quickly shows which hypotheses worked and which did not.

That is why my scoring layer was designed from the start to support continuous tuning.

What changes there:

weights of different signals;
penalties for weak or empty fragments;
priorities for specific trigger dictionaries;
length limits;
selection rules and deduplication conditions for similar moments.

This is very different from the feeling of “write the pipeline once and forget it.” In reality, a system like this is a living mechanism that constantly requires heuristic revision.

3. Interaction loop: automation around comments and audience warm-up

As a separate direction, I experimented with an audience-interaction module. The idea was that publishing videos is not the only activity around a channel. For new accounts — and in general for engagement growth — behavior in comments also matters.

So I built a separate layer that can generate replies in a more natural style rather than like a typical soulless bot.

For that, I used real conversation logs as a style dataset. The goal was not to “deceive the user,” but to avoid the typical bot-like spam tone and make responses closer to a natural human pattern of short interaction.

This is not the system’s main module yet, but as an engineering experiment it turned out to be quite useful: it showed that additional automation gradually starts growing around content production — not only for the video itself, but also for accompanying audience touchpoints.

Technologies used

From a stack perspective, this is not one monolithic “AI product,” but a composition of several practical tools, each solving its own part of the pipeline.

Python — the main orchestration language for the whole pipeline. It ties together video analysis, transcription, post-processing, subtitle generation, and supporting integrations.
MoviePy + imageio[ffmpeg] — clip assembly, work with video fragments, concatenation, export, and basic post-processing. At the low level, the whole story obviously rests on FFmpeg.
Whisper — audio transcription and timestamped speech segments. This layer is later used both for semantic analysis and subtitle rendering.
OpenCV + MediaPipe — frame analysis, face detection, scene-change handling, and signals for dynamic reframing. This layer helps the system understand where the main character is and how best to adapt a horizontal frame to a vertical format.
Pillow + Pilmoji — subtitle rendering, text styling over video, emoji handling, and visual packaging of the final clip.
NumPy — base computations, array handling, and numerical operations for signal analysis and intermediate processing.
PyYAML + python-dotenv — pipeline configuration, processing parameters, and environment management.
requests + lxml — obtaining and parsing source data, working with external sources, and automating the content-ingestion stage.
google-api-python-client + google-auth + google-auth-oauthlib — integrations with external Google services for surrounding automation around the pipeline.
Playwright — browser automation for cases where an interface-driven scenario is more convenient than an API.
inference-sdk + OpenAI API — separate AI layers and auxiliary inference tasks related to analysis and decision-making in the pipeline.
tqdm — a small operational detail, but useful: progress tracking for long batch jobs and easier debugging of long runs.

Why this stack specifically? Because this system is orchestration-heavy by nature. There is a lot of “glue” code, intermediate artifacts, batch processing, research iterations, and quick logic changes. For that kind of mode, Python turned out to be a natural choice.

If the task were reduced to one narrow, high-load media service, some components might make more sense in a lower-level implementation. But at the active R&D stage, speed of evolution, observability, and the ability to quickly change individual pipeline stages were more important to me than academic “stack purity.”

Which problems cost me the most time

From the outside, projects like this look flashy: “the system edits video by itself.” But the real work is largely a fight against edge cases.

Problem 1. A strong text moment is not always a strong visual moment

Transcription alone was not enough. The system regularly found strong lines in scenes that worked poorly as clips without visual context. The solution was to combine text with audio and CV signals.

Problem 2. A face is detected, but the frame still looks bad

The presence of face detection does not automatically mean the clip will look good in 9:16. Sometimes the object is found, but the composition still falls apart. I had to introduce additional constraints and a fallback strategy.

Problem 3. An overactive virtual camera becomes annoying

A naive implementation of dynamic cropping starts twitching very quickly and looks like broken auto-tracking. This required a lot of work on smoothing, inertia, and crop-window speed limits.

Problem 4. The “engineering-best” clip is not always the best by metrics

This was probably the most sobering moment. Sometimes a clip that feels more polished, coherent, and “higher quality” from a technical perspective performs worse than a simpler and rougher version. That is exactly why the analytics loop became necessary: it watches real results instead of my internal sense of pipeline beauty.

Problem 5. Any fully automated system must know how to degrade gracefully

There is no perfect detection, perfect tracking, or perfect moment selection. The question is not how to avoid ever making mistakes, but how to ensure the mistake does not destroy the entire release. That is why a large part of the system’s robustness is not accuracy in a vacuum, but competent fallback scenarios.

What was most surprising to me in this system

Probably the main unexpected conclusion was this: a significant part of “creative” content actually consists of repeatable, formalizable operations.

That does not mean taste, visual literacy, and a sense of rhythm are not important. On the contrary, they are. But it turned out that part of them can be transferred into a system of rules, constraints, priorities, scoring, and analytical feedback.

In other words, the task stops being magic and becomes engineering quality control under noisy data.

Where the system is still limited

It would be dishonest to pretend that a pipeline like this can already do everything.

It has clear weak spots:

complex scenes with fast action and chaotic motion;
moments where the meaning depends on long context rather than a short fragment;
scenes without pronounced facial focus;
cases where “virality” is determined by a very subtle cultural context rather than formal signals;
the risk of overfitting heuristics to one type of content or one audience.

And in my opinion, that is normal. A system should not pretend to be all-powerful. It is much more useful to understand where it works confidently and where it still needs improvement.

Where this architecture can be applied beyond anime

Although the project grew out of anime episodes, the architectural idea itself is not tied to that domain.

Essentially, it is a general template for any scenario where you have long-form source video and want to automatically produce short vertical clips:

streams and gaming broadcasts;
podcasts and interviews;
educational videos and lectures;
music content;
UGC platforms and media archives;
internal clip factories for content teams.

So the value here is not only in one specific channel, but in the approach itself: build not “one script for one video,” but a reproducible content-production line.

What came out at the end

On the current stage, the system already works as an autonomous loop capable of going through the main production steps without manual editing: from episode analysis to final clip render.

Some of the generated clips collected tens and even hundreds of thousands of views.

For me, that became an important validation not only of the product hypothesis, but also of the engineering one: a well-designed automated pipeline really can compete with manual production if it contains proper decision-making logic instead of random timeline slicing.

Even more importantly, this pipeline can be improved iteratively — not by intuition in a vacuum, but through measurable changes to individual modules.

Why I think systems like this will become a separate engineering direction

If you look more broadly, there is more and more long-form video around us, and the demand for short-form packaging keeps growing. At the same time, manual editing remains expensive, slow, and poorly scalable.

Against that backdrop, systems that can automatically:

analyze source media,
isolate potentially strong moments,
adapt framing to the required format,
package the result,
and close the loop through metrics,

will increasingly become an applied engineering problem rather than just a curious hobby.

In other words, this is no longer only about “content generation,” but about building automated media pipelines with observability, an R&D cycle, and quality control.

Conclusion

This project started as an experiment: is it even possible to partially automate a task that is usually considered almost entirely manual and creative?

Over time, it turned into something much more interesting — a system where content production is decomposed into engineering stages, and quality grows not only out of code, but also out of a feedback loop.

The main takeaway for me is this: automation in media is not just about saving time. It is a way to turn scattered creative operations into a reproducible production line that can be scaled, measured, debugged, and improved.

That is exactly the moment when a “channel with videos” stops being a set of random publications and becomes a system.

If there is interest, in the next article I can separately break down the technical details of one of the hardest modules — dynamic reframing / the “virtual camera”: how the focus area is selected, how movement is smoothed, which fallback modes are used, and where such algorithms most often break.

Video demonstration

If you would rather first see the system in action and only then go through the architecture layer by layer, I recorded a separate demo showing the entire pipeline: from episode processing to the final vertical clip.

Watch the system demo on YouTube