Lingtrain Aligner. How to make parallel books for language learning. Part 1. Python and Colab version / Habr

title

If you're interested in learning new languages or teaching them, then you probably know such a way as parallel reading. It helps to immerse yourself in the context, increases the vocabulary, and allows you to enjoy the learning process. When it comes to reading, you most likely want to choose your favorite author, theme, or something familiar and this is often impossible if no one has published such a variant of a parallel book. It's becoming even worse when you're learning some cool language like Hungarian or Japanese.

Today we are taking a big step forward toward breaking this situation.

We will use the lingtrain_aligner tool. It's an open-source project on Python which aims to help all the people eager to learn foreign languages. It's a part of the Lingtrain project, you can follow us on Telegram, Facebook and Instagram. Let's start!

Find the texts

At first, we should find two texts we want to align. Let's take two editions of "To Kill a Mockingbird" by Harper Lee, in Russian and the original one.

The first lines of the found texts look like this:

TO KILL A MOCKINGBIRD
by Harper Lee

DEDICATION
for Mr. Lee and Alice
in consideration of Love & Affection
Lawyers, I suppose, were children once.

Charles Lamb

PART ONE
1
When he was nearly thirteen, my brother Jem got his arm badly
broken at the elbow. When it healed, and Jem’s fears of never being
able to play football were assuaged, he was seldom self-conscious about
his injury. His left arm was somewhat shorter than his right; when he
stood or walked, the back of his hand was at right angles to his body,
his thumb parallel to his thigh. He couldn’t have cared less, so long as
he could pass and punt.

...

Харпер Ли

Убить пересмешника

Юристы, наверно, тоже когда-то были детьми.
    Чарлз Лэм

ЧАСТЬ ПЕРВАЯ

1

Незадолго до того, как моему брату Джиму исполнилось тринадцать, у него

была сломана рука. Когда рука зажила и Джим перестал бояться, что не сможет
играть в футбол, он ее почти не стеснялся. Левая рука стала немного короче правой;
когда Джим стоял или ходил, ладонь была повернута к боку ребром. Но ему это
было все равно — лишь бы не мешало бегать и гонять мяч.

...

Extract parallel corpora

The first step is to make a parallel corpus from our texts. Is a serious task, mainly because of the following reasons:

Professional translators are kind of artists. They can translate one sentence as several and vice versa. They feel the language and can be very creative in the desire to convey the meaning.
Some parts of the translated text can be missing.
During extraction we need to save the paragraph structure somehow. Without it, we will not be able to create a solid and decorated book.

We will use the python lingtrain_aligner library which I'm developing. Under the hood, it uses machine learning models (sentence-transformers, LaBSE, and others). Such models will transform sentences into dense vectors or embeddings. Embeddings are a very interesting way to catch a sense contained in a sentence. We can calculate a cosine distance between the vectors and interpret it as semantic similarity. The most powerful (and huge) model is LaBSE by Google. It supports over 100 languages.

Before we feed our texts into the tool we need to prepare them.

Prepare the texts

Add the markup

I've made a simple markup language to extract the book structure right before the alignment. It's just a special kind of token that you need to add to the end of the sentence.

Token	Purpose	Mode
%%%%%title.	Title	Manual
%%%%%author.	Author	Manual
%%%%%h1. %%%%%h2. %%%%%h3. %%%%%h4. %%%%%h5.	Headings	Manual
%%%%%qtext.	Quote	Manual
%%%%%qname.	Text under the quote	Manual
%%%%%image.	Image	Manual
%%%%%translator.	Переводчик	Manual
%%%%%divider.	Divider	Manual
%%%%%.	New paragraph	Auto

New paragraph token

This kind of token will be placed automatically following the rule:

if the line ends with [.,:,!?] character and EOF (end of the line, '/n' char) we treat it as the end of the paragraph.

Text preprocessing

Delete unnecessary lines (publisher information, page numbers, notes, etc.).
Put labels for the author and the title.
Label the headings (H1 is the largest, H5 is the smallest). If the headers aren't needed, then just delete them.
Make sure that there are no lines in the text that end with a [.,:,!?] and aren't the end of a paragraph (otherwise the whole paragraph will be split into two parts).

Place the labels in accordance with the aforementioned rules. Empty lines do not play any role. You should get documents similar to these:

TO KILL A MOCKINGBIRD%%%%%title.
by Harper Lee%%%%%author.

Lawyers, I suppose, were children once.%%%%%qtext.
Charles Lamb%%%%%qname.

PART ONE%%%%%h1.
1%%%%%h2.

When he was nearly thirteen, my brother Jem got his arm badly
broken at the elbow. When it healed, and Jem’s fears of never being
able to play football were assuaged, he was seldom self-conscious about
his injury. His left arm was somewhat shorter than his right; when he
stood or walked, the back of his hand was at right angles to his body,
his thumb parallel to his thigh. He couldn’t have cared less, so long as
he could pass and punt.

...

Харпер Ли%%%%%author.
Убить пересмешника%%%%%title.

Юристы, наверно, тоже когда-то были детьми.%%%%%qtext.
Чарлз Лэм%%%%%qname.

ЧАСТЬ ПЕРВАЯ%%%%%h1.
1%%%%%h2.

Незадолго до того, как моему брату Джиму исполнилось тринадцать,
у него была сломана рука. Когда рука зажила и Джим перестал бояться,
что не сможет играть в футбол, он ее почти не стеснялся. Левая рука стала
немного короче правой; когда Джим стоял или ходил, ладонь была повернута к
боку ребром. Но ему это было все равно - лишь бы не мешало бегать и
гонять мяч.

...

Marked lines will be automatically extracted from the texts before alignment. They will be used when we will make a book.

Align texts

Colab

We will do the whole process in the Google Colab notebook, so everyone can do the same for free without installing anything on his or her machine. I've prepared the Colab, it contains all the instructions.

Colab version

Meanwhile, we will observe the alignment process in more detail.

Details

After installing the tool with this command:

pip install lingtrain-aligner

We are loading our texts, adding the paragraph tokens and splitting them into the sentences:

from lingtrain_aligner import preprocessor, splitter, aligner, resolver, reader, vis_helper

text1_input = "harper_lee_ru.txt"
text2_input = "harper_lee_en.txt"

with open(text1_input, "r", encoding="utf8") as input1:
text1 = input1.readlines()

with open(text2_input, "r", encoding="utf8") as input2:
text2 = input2.readlines()

db_path = "book.db"

lang_from = "ru"
lang_to = "en"

models = ["sentence_transformer_multilingual", "sentence_transformer_multilingual_labse"]
model_name = models[0]

text1_prepared = preprocessor.mark_paragraphs(text1)
text2_prepared = preprocessor.mark_paragraphs(text2)

splitted_from = splitter.split_by_sentences_wrapper(text1_prepared , lang_from, leave_marks=True)
splitted_to = splitter.split_by_sentences_wrapper(text2_prepared , lang_to, leave_marks=True)

_dbpath is the heart of the alignment. It's an SQLite database that will hold information about the alignment and document structure. Also, note that we provided the language codes ("en" and "ru"). This means that some language-specific rules will be applied during the splitting. You can find all supported languages with this command:

splitter.get_supported_languages()

If your language is not here make an issue on GitHub or write in our group in telegram. You can also use the "xx" code to use some base rules for your text.

Now, when texts are split, let's load them into the database:

aligner.fill_db(db_path, splitted_from, splitted_to)

Primary alignment

Now we will align the texts. It's a batched process. Parts of texts size of batch_size with some extra lines size of window will align together with the primary alignment algorithm.

batch_ids = [0,1,2,3]

aligner.align_db(db_path,
                model_name,
                batch_size=100,
                window=30,
                batch_ids=batch_ids,
                save_pic=False,
                embed_batch_size=50,
                normalize_embeddings=True,
                show_progress_bar=True
                )

Let's see the result of the primary alignment. vis_helper helps us to plot the alignment structure:

vis_helper.visualize_alignment_by_db(db_path,
                                    output_path="alignment_vis.png",
                                    lang_name_from=lang_from,
                                    lang_name_to=lang_to,
                                    batch_size=400,
                                    size=(800,800),
                                    plt_show=True
                                    )

Not bad. But there are a lot of conflicts. Why? Consider the following reasons:

Model has too many good variants. If the line is short (some kind of a chat or just a name) the model can find another similar line in the window and take it as a suitable choice.
The right line is not in the search interval. Texts have different counts of sentences and the "alignment axis" can go beyond the window.

To handle the second problem you can use the shift parameter.

And to handle conflicts there is a special module called resolver.

Resolve the conflicts

We can observe all the found conflicts using the following command:

conflicts_to_solve, rest = resolver.get_all_conflicts(db_path, min_chain_length=2, max_conflicts_len=6)

conflicts to solve: 46
total conflicts: 47

And some statistics:

resolver.get_statistics(conflicts_to_solve)
resolver.get_statistics(rest)

The most frequent conflicts are the size of '2:3' and '3:2'. It means that one of the sentences here was translated as two or vise versa.

resolver.show_conflict(db_path, conflicts_to_solve[10])

124 Дом Рэдли стоял в том месте, где улица к югу от нас описывает крутую дугу.
125 Если идти в ту сторону, кажется, вот—вот упрешься в их крыльцо.
126 Но тут тротуар поворачивает и огибает их участок.

122 The Radley Place jutted into a sharp curve beyond our house.
123 Walking south, one faced its porch; the sidewalk turned and ran beside the lot.

The most successful strategy that came into my mind is to resolve the conflicts iteratively. From smaller to bigger.

steps = 3
batch_id = -1  #all batches

for i in range(steps):
    conflicts, rest = resolver.get_all_conflicts(db_path, min_chain_length=2+i, max_conflicts_len=6*(i+1), batch_id=batch_id)

    resolver.resolve_all_conflicts(db_path, conflicts, model_name, show_logs=False)

    vis_helper.visualize_alignment_by_db(db_path, output_path="img_test1.png", batch_size=400, size=(800,800), plt_show=True)

    if len(rest) == 0:
        break

Visualization after the first step:

And after the second:

Great! Now our book.db file holds the aligned texts along with the structure of the book (thanks to markup).

Create a book

The module called reader will help us to create a book.

from lingtrain_aligner import reader

paragraphs_from, paragraphs_to, meta = reader.get_paragraphs(db_path, direction="from")

With the direction parameter ["from", "to"] you can choose which paragraph structure is needed (the first text or the second).

Let's create it:

reader.create_book(paragraphs_from,
                    paragraphs_to,
                    meta,
                    output_path = "lingtrain.html"
                    )

And will see this as input:

It's a simple styled HTML page. I've added some styles to make it even useful for language learners! It's a template parameter.

reader.create_book(paragraphs_from, 
                    paragraphs_to,
                    meta,
                    output_path = "lingtrain.html",
                    template="pastel_fill"
                    )

reader.create_book(paragraphs_from,
                    paragraphs_to,
                    meta,
                    output_path = f"lingtrain.html",
                    template="pastel_start"
                    )

Custom styles

You can even use your own style. For example, let's highlight all even sentences in the book:

my_style = [
    '{}',
    '{"background": "#fafad2"}',
    ]

reader.create_book(paragraphs_from,
                    paragraphs_to,
                    meta,
                    output_path = f"lingtrain.html",
                    template="custom",
                    styles=my_style
                    )

You can use any applicable to span CSS styles:

my_style = [
    '{"background": "linear-gradient(90deg, #FDEB71 0px, #fff 150px)", "border-radius": "15px"}',
    '{"background": "linear-gradient(90deg, #ABDCFF 0px, #fff 150px)", "border-radius": "15px"}',
    '{"background": "linear-gradient(90deg, #FEB692 0px, #fff 150px)", "border-radius": "15px"}',
    '{"background": "linear-gradient(90deg, #CE9FFC 0px, #fff 150px)", "border-radius": "15px"}',
    '{"background": "linear-gradient(90deg, #81FBB8 0px, #fff 150px)", "border-radius": "15px"}'
    ]

reader.create_book(paragraphs_from,
                    paragraphs_to,
                    meta,
                    output_path = f"lingtrain.html",
                    template="custom",
                    styles=my_style
                    )

I hope it will be helpful for all who love languages. Have fun!

Next time we will discuss multilingual books creation and use the UI tool which I'm working on. Stay tuned.

To be continued

It is an open-source project. You can take a part in it and find the code on ours github page. Today's Colab is here.

You can also support the project by making a donation.

Lingtrain Aligner. How to make parallel books for language learning. Part 1. Python and Colab version