The Rules for Data Processing Pipeline Builders / Comments / Habr

VlK Dec 31 2020 at 13:36

Yes, this is very similar to core ideas of functional programming. Thinking about it know I would probably also add transformation purity as a rule.

if move is executed only as a last step of a transformation then it will be fine. Relying on atomicity of mv and intermediate temporary files in POSIX-compatible OSes (Linux/BSD/etc) is a very popular pattern. It also works in HDFS. I know that there are many other definitions of atomicity (like atomic types and functions in ISO C11). Anyways, the point is that the final mv (the underlying syscall — rename) is a part of the transformation. Besides, I wasn't talking about a particular trick for transaction imitation, it was more about how a transformation should work.
Our transformations typically look like a thin shell wrapper around a program to be executed (Python, Java/Spark, shell scripts calling into databases or HDFS) by a driver machine. Every transformation can fail, succeed or abort the transformation chain. We compose these transformations using our custom workflow manager (somewhat similar to Apache Airflow) that manages transformation attempts, restarts, etc.
Easy. Purity, atomicity, idempotence are just nice properties to have. But similar to how people like Haskell at first but then realise some simple things are just too hard, in data processing pipelines there are many examples of how we just can't have them. Examples: buffer flushing, putting some kind of state aside, all kinds of hidden global state...

Comments 2

ynikitenko Dec 30 2020 at 21:18

Thanks for the post. The idea of many small steps is known and is in the core of functional programming. I have several questions though.
1) In the Atomic steps section you give an example, when «the data will only be partially transformed, and further pipeline steps will have no way of knowing that. At the end of the pipe, you’ll only get partial data. » — mv doesn't help in this case, because it doesn't know whether the tmp file is complete or not. I think atomicity is something different, and difficult if you accept any series of data (5 or 500 comments are both fine) and your scripts are really independent (don't know about each other).
2) Is your whole pipeline really a set of scripts and mv, ln? Why don't you do that completely in Python?
3) «does help to identify their applicability limits, and to step over them if necessary.» — why don't you expand on that? :)