Welcome to pipeline-flow!¶

pipeline-flow allows you to concetrate on the logic and structure of your data pipelines, abstracting away the complexities of data orchestration and management.

It is a lightweight, scalable and extensible platform designed for managing ELT, ETL and ETLT data pipelines. Being platform agnostic, it can run anywhere with a Python environment, offering a simple and yet flexible solution for managing data workflows.

Ideal for small to medium-size data workflows without the overhead of full-scale orchestration tools, pipeline-flow makes building data pipelines simple and accessible.

With its YAML-based configuration and plugin-based architecture, you can easily define and run data pipelines with minimal effort and maximum flexibility, as well as extend its functionality as needed. Whether you’re using Spark, Polaris or any other data processing engine, you can easily integrate it with pipeline-flow.

This documentation will guide you through the process of setting up and using pipeline-flow. If you have any questions or need help, please don’t hesitate to reach out to us.

Why¶

Managing many complex data workflows is a common challenge in data engineering. Although, traditional tools like Apache Luigi and Airflow are powerful, they often require complex setups, heavyweight infrastructure, and extensive maintenance paired with a steep learning curve.

What if you could have a simple, but yet flexible solution? That’s where pipeline-flow comes in. It follows a straightforward philisophy: define your pipelines and trigger them with minimal overhead using a simple YAML config!! Simple as that.

As the community grows, so does the number of plugins available, making it easy to use existing solutions or build your own!!

Key Features¶

Easy Configuration: Define pipelines with a simple YAML file (or in Python Code if you prefer).
Asynchronous Execution: Execute multiple tasks concurrently to reduce processing time by switching between tasks blocked by I/O.
Dynamic Plugin System: Extend the functionality by creating custom plugins or using existing ones.
Resilient and Fault-Tolerant: Automatically retries failed tasks and ensures data integrity. (Still in development)
Scalable Design: Can handle data workflows of any size, from small batch jobs to large-scale pipelines as long the right data processing engine is used.
Pipeline Dependency: Manage pipeline dependencies and their execution order.

Use Cases¶

pipeline-flow is ideal for:

Data Pipelines: Extract, transform, and load data into warehouses or lakes.
Workflow Automation: Automating data transformation and loading processes.
Data Lineage: Tracking data lineage across different phases of the pipeline, as well as the interactions between different pipelines.

How It Works¶

At a high level, pipeline-flow works as follows:

Define Pipelines: Specify the steps (e.g., extract, transform, load) using plugins.
Configure Settings: Customize behaviour with a YAML configuration file.
Run Asynchronously: Execute tasks in parallel to maximize efficiency.
Monitor and Adjust: Track progress, debug errors, and make runtime adjustments.

Getting Started¶

Ready to dive in? Check out the Quick Start guide to set up your first pipeline in just a few minutes!

Contents¶

Introduction

Advanced

Architecture

Examples

Changelog

Version history

Roadmap

Roadmap