Pull to refresh
535.15

Python *

Interpreted high-level programming language for general-purpose programming

Show first
Rating limit
Level of difficulty

How we tackled document recognition issues for autonomus and automatic payments using OCR and NER

Reading time5 min
Views1.3K

In this article, I would like to describe how we’ve tackled the named entity recognition (aka NER) issue at Sber with the help of advanced AI techniques. It is one of many natural language processing (NLP) tasks that allows you to automatically extract data from unstructured text. This includes monetary values, dates, or names, surnames and positions.

Just imagine countless textual documents even a medium-sized organisation deals with on a daily basis, let alone huge corporations. Take Sber, for example: it is the largest financial institution in Russia, Central and Eastern Europe that has about 16,500 offices with over 250,000 employees, 137 million retail and 1.1 million corporate clients in 22 countries. As you can imagine, with such an enormous scale, the company collaborates with hundreds of suppliers, contractors and other counterparties, which implies thousands of contracts. For instance, the estimated number of legal documents to be processed in 2022 has been over 65,000, each of them consisting of 30 pages on average. During the lifecycle of a contract, a contract usually updated with 3 to 5 additional agreements. On top of this, a contract is accompanied by various source documents describing transactions. And in the PDF format, too.

Previously, the processing duty befell our service centre’s employees who checked whether payment details in a bill match those in the contract and then sent the document to the Accounting Department where an accountant double-checked everything. This is quite a long journey to a payment, right?

Read more

Django admin dynamic Inline positioning

Reading time5 min
Views12K

Recently I've received an interesting request from a client about one of our Django projects.
He asked if it would be possible to show an inline component above other fields in the Django admin panel.

At the beginning I thought, that there shouldn't be any issue with that.
Though there was no easy solution other then installing another battery to the project. My gut feeling told me, there were another way around that problem.

Read more about ModelAdmin with Inlines

Stop losing clients! Or how a developer can test a website, by the example of PVS-Studio. Part 1

Reading time15 min
Views1.1K

A website with bugs could be a real pain in the neck for business. Just one 404 or 500 error could end up costing an obscene amount of money for the company and hurt a good reputation. But there is a way to avoid this issue: the website testing. That's sort of what this article is about. After reading this article, you will learn how to test code in Django, create your "own website tester" and much more. Welcome to the article.

Read more

Easy concurrency with Python Shared Object

Reading time23 min
Views9K

Project repository.
Year old article about general concepts of the project.


So you want to build a multitasking system using python? But you actually hesitate because you know you'll have to either use multitasking module, which is slow and/or somewhat inconvenient, or a more powerfull external tool like Redis or RabbitMQ or even large DBMS like MongoDB or PostgreSQL, which require some glue (i.e. very far from native python code) and apply their own restrictions on what you can do with your data. If you think «why do I need so much hassle if I just want to run few worker threads in python using the data structures I already have in my python program and using functions I've already written? I just want to run this code in threads! Oh, I wish there was no GIL in Python» — then welcome to the club.


Of course many of us can build from scratch a decent tool that would make use of multiple cores. However, having already existing working software (Pandas, Tensorflow, SciPy, etc) is always cheaper than any development of new software. But the status quo in CPython tells us one thing: you cannot remove GIL because everything is based on GIL. Although making shit into gold could require much work, the ability to alleviate the transition from slow single-threaded shit to a slow not-so-single-threaded gold-looking shit might be worth it, so you won't have to rewrite your whole system from scratch.


Read more →

We have published a model for text repunctuation and recapitalization for four languages

Reading time7 min
Views7.5K


Open In Colab


Working with speech recognition models we often encounter misconceptions among potential customers and users (mostly related to the fact that people have a hard time distinguishing substance over form). People also tend to believe that punctuation marks and spaces are somehow obviously present in spoken speech, when in fact real spoken speech and written speech are entirely different beasts.


Of course you can just start each sentence with a capital letter and put a full stop at the end. But it is preferable to have some relatively simple and universal solution for "restoring" punctuation marks and capital letters in sentences that our speech recognition system generates. And it would be really nice if such a system worked with any texts in general.


For this reason, we would like to share a system that:


  • Inserts capital letters and basic punctuation marks (dot, comma, hyphen, question mark, exclamation mark, dash for Russian);
  • Works for 4 languages (Russian, English, German, Spanish) and can be extended;
  • By design is domain agnostic and is not based on any hard-coded rules;
  • Has non-trivial metrics and succeeds in the task of improving text readability;

To reiterate — the purpose of such a system is only to improve the readability of the text. It does not add information to the text that did not originally exist.

Read more →

Mode on: Comparing the two best colorization AI's

Reading time11 min
Views4.2K

This article continues a series of notes about colorization. During today's experiment, we’ll be comparing a recent neural network with the good old Deoldify to gauge the rate at which the future is approaching.

This is a practical project, so we won’t pay extra attention to the underlying philosophy of the Transformer architecture. Besides, any attempt to explain the principles of its operation to a wide public in hand waving terms would become misguiding.

A lecturer: Mr. Petrov! How does a transformer work?
Petrov with a bass voice: Hum-m-m-m.


Google Colorizing Transformer vs Deoldify

Read more →

Data Phoenix Digest — 01.07.2021

Reading time5 min
Views2K

We at Data Science Digest have always strived to ignite the fire of knowledge in the AI community. We’re proud to have helped thousands of people to learn something new and give you the tools to push ahead. And we’ve not been standing still, either.

Please meet Data Phoenix, a Data Science Digest rebranded and risen anew from our own flame. Our mission is to help everyone interested in Data Science and AI/ML to expand the frontiers of knowledge. More news, more updates, and webinars(!) are coming. Stay tuned!

The new issue of the new Data Phoenix Digest is here! AI that helps write code, EU’s ban on biometric surveillance, genetic algorithms for NLP, multivariate probabilistic regression with NGBoosting, alias-free GAN, MLOps toys, and more…

If you’re more used to getting updates every day, subscribe to our Telegram channel or follow us on social media: TwitterFacebook.

Read more

DataScience Digest — 24.06.21

Reading time5 min
Views2K

The new issue of DataScienceDigest is here!

The impact of NLP and the growing budgets to drive AI transformations. How Airbnb standardized metric computation at scale. Cross-Validation, MASA-SR, AgileGAN, EfficientNetV2, and more.

If you’re more used to getting updates every day, subscribe to our Telegram channel or follow us on social media: Twitter, LinkedIn, Facebook.

Read more

You are standing at a red light at an empty intersection. How to make traffic lights smarter?

Reading time14 min
Views2.3K

Types of smart traffic lights: adaptive and neural networks

Adaptive works at relatively simple intersections, where the rules and possibilities for switching phases are quite obvious. Adaptive management is only applicable where there is no constant loading in all directions, otherwise it simply has nothing to adapt to – there are no free time windows. The first adaptive control intersections appeared in the United States in the early 70s of the last century. Unfortunately, they have reached Russia only now, their number according to some estimates does not exceed 3,000 in the country.

Neural networks – a higher level of traffic regulation. They take into account a lot of factors at once, which are not even always obvious. Their result is based on self-learning: the computer receives live data on the bandwidth and selects the maximum value by all possible algorithms, so that in total, as many vehicles as possible pass from all sides in a comfortable mode per unit of time. How this is done, usually programmers answer – we do not know, the neural network is a black box, but we will reveal the basic principles to you…

Adaptive traffic lights use, at least, leading companies in Russia, rather outdated technology for counting vehicles at intersections: physical sensors or video background detector. A capacitive sensor or an induction loop only sees the vehicle at the installation site-for a few meters, unless of course you spend millions on laying them along the entire length of the roadway. The video background detector shows only the filling of the roadway with vehicles relative to this roadway. The camera should clearly see this area, which is quite difficult at a long distance due to the perspective and is highly susceptible to atmospheric interference: even a light snowstorm will be diagnosed as the presence of traffic – the background video detector does not distinguish the type of detection.

Read more

Data Science Digest — 21.04.21

Reading time3 min
Views1.1K

Hi All,

I’m pleased to invite you all to enroll in the Lviv Data Science Summer School, to delve into advanced methods and tools of Data Science and Machine Learning, including such domains as CV, NLP, Healthcare, Social Network Analysis, and Urban Data Science. The courses are practice-oriented and are geared towards undergraduates, Ph.D. students, and young professionals (intermediate level). The studies begin July 19–30 and will be hosted online. Make sure to apply — Spots are running fast!

If you’re more used to getting updates every day, follow us on social media:

Telegram
Twitter
LinkedIn
Facebook

Regards,
Dmitry Spodarets.

Read more

Neural network Telegram bot with StyleGAN and GPT-2

Reading time3 min
Views5.5K

The Beginning


So we have already played with different neural networks. Cursed image generation using GANs, deep texts from GPT-2 — we have seen it all.


This time I wanted to create a neural entity that would act like a beauty blogger. This meant it would have to post pictures like Instagram influencers do and generate the same kind of narcissistic texts. \


Initially I planned to post the neural content on Instagram but using the Facebook Graph API which is needed to go beyond read-only was too painful for me. So I reverted to Telegram which is one of my favorite social products overall.


The name of the entity/channel (Aida Enelpi) is a bad neural-oriented pun mostly generated by the bot itself.


One of the first posts generated by Aida

Read more →

Data Science Digest — We Are Back

Reading time5 min
Views1.2K

Hi All,

I have some good news for you…

Data Science Digest is back! We’ve been “offline” for a while, but no worries — You’ll receive regular digest updates with top news and resources on AI/ML/DS every Wednesday, starting today.

If you’re more used to getting updates every day, follow us on social media:

Telegram - https://t.me/DataScienceDigest
Twitter - https://twitter.com/Data_Digest
LinkedIn - https://www.linkedin.com/company/data-science-digest/
Facebook - https://www.facebook.com/DataScienceDigest/

And finally, your feedback is very much appreciated. Feel free to share any ideas with me and the team, and we’ll do our best to make Data Science Digest a better place for all.

Regards,
Dmitry Spodarets.

Read more

HDB++ TANGO Archiving System

Reading time3 min
Views1.3K
main

What is HDB++?


This is a TANGO archiving system, allows you to save data received from devices in the TANGO system.


Working with Linux will be described here (TangoBox 9.3 on base Ubuntu 18.04), this is a ready-made system where everything is configured.


What is the article about?


  • System architecture.
  • How to set up archiving.

It took me ~ 2 weeks to understand the architecture and write my own scripts for python for this case.


What is it for?


Allows you to store the history of the readings of your equipment.


  • You don't need to think about how to store data in the database.
  • You just need to specify which attributes to archive from which equipment.
Read more →

Authors' contribution