Skip to main content

Airbyte - Data Ingestion

Self-hosted Airbyte on AWS for extracting data from Pixlr (MongoDB) and Designs.ai (MySQL) into ClickHouse.

Overview

We use self-hosted Airbyte instead of the hosted version to avoid per-data pricing and have full control over our data pipeline. This document provides a comprehensive guide to setting up and managing Airbyte.

Prerequisites

  • AWS account with appropriate permissions
  • Domain name: airbyte.pixlr.to
  • Docker and Docker Compose installed

Deployment Options

Option 1: EC2 with abctl (Current Setup)

For our current production setup, we're using an EC2 instance managed by abctl with automated scheduling to optimize costs.

Key Features:

  • Automated instance scheduling
  • Cost-effective solution
  • Simplified management with abctl

For detailed setup instructions, see EC2 with abctl (Current Setup).

Option 2: ECS Fargate

  • Pros: Auto-scaling, managed infrastructure
  • Cons: More complex setup, higher cost

Airbyte Configuration

1. Docker Setup

# Install Docker
sudo yum install -y docker

# Add ec2-user to docker group
sudo usermod -aG docker ec2-user

# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker

# Log out and log back in for group changes to take effect
exit

Reference: Using an EC2 Instance with abctl | Airbyte Docs

Also check this for more info about deploying Airbyte, official Airbyte documentation.

2. Install Airbyte using abctl

Initial Installation

# Install abctl (Airbyte CLI)
curl -LsfS https://get.airbyte.com | bash -

export PATH=$HOME/.airbyte/abctl:$PATH
echo 'export PATH=$HOME/.airbyte/abctl:$PATH' >> ~/.bashrc
source ~/.bashrc

# Verify installation
abctl --version

Configure and Install Airbyte

# Set hostname
export AIRBYTE_HOSTNAME=$(curl -s http://169.254.169.254/latest/meta-data/public-hostname)

# Install with insecure cookies (for HTTP)
abctl local install --host $AIRBYTE_HOSTNAME --insecure-cookies

Verify Installation

# Check Kubernetes pods
export KUBECONFIG=$HOME/.airbyte/abctl/abctl.kubeconfig
kubectl get pods -n airbyte-abctl

# Check services
kubectl get svc -n airbyte-abctl

Start Port Forwarding

# Run in a screen/tmux session for persistence
kubectl port-forward --address 0.0.0.0 svc/airbyte-abctl-airbyte-server-svc 8080:8001 -n airbyte-abctl
# Install abctl (Airbyte CLI)
curl -LsfS https://get.airbyte.com | bash -

echo 'export PATH=$HOME/.abctl/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

# Verify installation
abctl version

3. Configure and Install Airbyte

# Create data directory with permissions
mkdir -p ~/.airbyte
sudo chown -R 1000:1000 ~/.airbyte
sudo chmod -R 755 ~/.airbyte

# installee with the domainname
abctl local install --host airbyte.pixlr.to

# get login cred
abctl local credentials

4. Access Airbyte Web UI

Current Setup (use the airbyte.pixlr.to)

  • Access Method: Port forwarding via SSH
  • URL: http://<my-ec2-ip>:8000
  • Security:
    • Port 8000 exposed to your IP in EC2 security group
    • Basic authentication only

Cloudflare Tunnel Setup

For Cloudflare Tunnel Configuration

Follow the instructions in airbyte-cloudflare-setup.md

Authentication & Access

Initial Login Credentials

After installation, retrieve credentials using:

abctl local credentials

Example output:

Email: [not set]
Password: cTG45kaMtgl2yLL5NUhmEYKvjSJxXqgU
Client-Id: 63f35ea2-84e4-4863-ab32-7d78a2bebed5
Client-Secret: KgCD0IgRq7tl3qq8sEQYyukqaDK8IxMj

Workspace Configuration

Environment Variables

For our Airbyte setup, we've configured the following environment variables:

# In ~/.bashrc or ~/.bash_profile
export AIRBYTE_HOST=airbyte.pixlr.to
export AIRBYTE_PORT=8000

Workspace Settings

  1. Workspace Name: Pixlr Analytics
  2. Default Geography: AWS (North Virginia us-east-1)
  3. Email Notifications: Enabled for pipeline failures

User Management

Admin Users:

  • admin@pixlr.to (Owner)
  • itadmin@pixlr.com (Admin)

Data Sources

Pixlr (MongoDB)

  • Collections: users, sessions, images, subscriptions, events
  • Sync Frequency: Every 15 minutes
  • Sync Type: Incremental
  • Connection Details:
    Host: [MongoDB Host]
    Port: 27017
    Database: [Database Name]
    Authentication: Username/Password

Designs.ai (MySQL)

  • Tables: users, projects, subscriptions, usage_logs, billing
  • Sync Frequency: Every 30 minutes
  • Sync Type: Incremental
  • Connection Details:
    Host: [MySQL Host]
    Port: 3306
    Database: [Database Name]
    Authentication: Username/Password

Destination

  • ClickHouse Hosted Service
  • Database: analytics
  • Connection: Secure connection from AWS to ClickHouse

Common Issues & Resolutions

Port Conflicts

# Check for processes using port 8000
sudo lsof -i :8000

# If needed, use alternative port
kubectl port-forward --address 0.0.0.0 svc/airbyte-abctl-airbyte-server-svc 8080:8001 -n airbyte-abctl

Reinstallation

# Uninstall existing installation
abctl local uninstall --persisted

# Clean up
sudo rm -rf ~/.airbyte
sudo rm -rf ~/.kube

# Reinstall
abctl local install --host $AIRBYTE_HOSTNAME --insecure-cookies

Pod Issues

# Check pod status
kubectl get pods -n airbyte-abctl

# View logs for a specific pod
kubectl logs -f <pod-name> -n airbyte-abctl

# Delete and restart a problematic pod
kubectl delete pod <pod-name> -n airbyte-abctl

Maintenance & Troubleshooting

Common Issues

  1. Permission Denied Errors

    # fix permissions
    sudo chown -R 1000:1000 ~/.airbyte
    sudo chmod -R 755 ~/.airbyte
  2. Reset Admin Password

    # Get container ID
    docker ps | grep airbyte-server

    # Reset password
    docker exec -it [CONTAINER_ID] reset-password

Backup & Restore

To backup Airbyte configuration and data:

# Backup volumes
docker run --rm -v airbyte_workspace:/source -v $(pwd):/backup alpine tar czf /backup/airbyte_backup_$(date +%Y%m%d).tar.gz -C /source .

# Restore from backup
docker run --rm -v airbyte_workspace:/target -v $(pwd):/backup alpine sh -c "cd /target && tar xzf /backup/airbyte_backup_20230930.tar.gz"

Key Benefits

  • Cost Control: No per-data pricing
  • Data Control: Full control over data processing
  • Flexibility: Custom connectors and transformations
  • Security: Data stays within our infrastructure

Next Steps