Airbyte - Data Ingestion
Self-hosted Airbyte on AWS for extracting data from Pixlr (MongoDB) and Designs.ai (MySQL) into ClickHouse.
Overview
We use self-hosted Airbyte instead of the hosted version to avoid per-data pricing and have full control over our data pipeline. This document provides a comprehensive guide to setting up and managing Airbyte.
Prerequisites
- AWS account with appropriate permissions
- Domain name: airbyte.pixlr.to
- Docker and Docker Compose installed
Deployment Options
Option 1: EC2 with abctl (Current Setup)
For our current production setup, we're using an EC2 instance managed by abctl with automated scheduling to optimize costs.
Key Features:
- Automated instance scheduling
- Cost-effective solution
- Simplified management with abctl
For detailed setup instructions, see EC2 with abctl (Current Setup).
Option 2: ECS Fargate
- Pros: Auto-scaling, managed infrastructure
- Cons: More complex setup, higher cost
Airbyte Configuration
1. Docker Setup
# Install Docker
sudo yum install -y docker
# Add ec2-user to docker group
sudo usermod -aG docker ec2-user
# Start and enable Docker
sudo systemctl start docker
sudo systemctl enable docker
# Log out and log back in for group changes to take effect
exit
Also check this for more info about deploying Airbyte, official Airbyte documentation.
2. Install Airbyte using abctl
Initial Installation
# Install abctl (Airbyte CLI)
curl -LsfS https://get.airbyte.com | bash -
export PATH=$HOME/.airbyte/abctl:$PATH
echo 'export PATH=$HOME/.airbyte/abctl:$PATH' >> ~/.bashrc
source ~/.bashrc
# Verify installation
abctl --version
Configure and Install Airbyte
# Set hostname
export AIRBYTE_HOSTNAME=$(curl -s http://169.254.169.254/latest/meta-data/public-hostname)
# Install with insecure cookies (for HTTP)
abctl local install --host $AIRBYTE_HOSTNAME --insecure-cookies
Verify Installation
# Check Kubernetes pods
export KUBECONFIG=$HOME/.airbyte/abctl/abctl.kubeconfig
kubectl get pods -n airbyte-abctl
# Check services
kubectl get svc -n airbyte-abctl
Start Port Forwarding
# Run in a screen/tmux session for persistence
kubectl port-forward --address 0.0.0.0 svc/airbyte-abctl-airbyte-server-svc 8080:8001 -n airbyte-abctl
# Install abctl (Airbyte CLI)
curl -LsfS https://get.airbyte.com | bash -
echo 'export PATH=$HOME/.abctl/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
# Verify installation
abctl version
3. Configure and Install Airbyte
# Create data directory with permissions
mkdir -p ~/.airbyte
sudo chown -R 1000:1000 ~/.airbyte
sudo chmod -R 755 ~/.airbyte
# installee with the domainname
abctl local install --host airbyte.pixlr.to
# get login cred
abctl local credentials
4. Access Airbyte Web UI
Current Setup (use the airbyte.pixlr.to)
- Access Method: Port forwarding via SSH
- URL:
http://<my-ec2-ip>:8000 - Security:
- Port 8000 exposed to your IP in EC2 security group
- Basic authentication only
Cloudflare Tunnel Setup
For Cloudflare Tunnel Configuration
Follow the instructions in airbyte-cloudflare-setup.md
Authentication & Access
Initial Login Credentials
After installation, retrieve credentials using:
abctl local credentials
Example output:
Email: [not set]
Password: cTG45kaMtgl2yLL5NUhmEYKvjSJxXqgU
Client-Id: 63f35ea2-84e4-4863-ab32-7d78a2bebed5
Client-Secret: KgCD0IgRq7tl3qq8sEQYyukqaDK8IxMj
Workspace Configuration
Environment Variables
For our Airbyte setup, we've configured the following environment variables:
# In ~/.bashrc or ~/.bash_profile
export AIRBYTE_HOST=airbyte.pixlr.to
export AIRBYTE_PORT=8000
Workspace Settings
- Workspace Name: Pixlr Analytics
- Default Geography: AWS (North Virginia us-east-1)
- Email Notifications: Enabled for pipeline failures
User Management
Admin Users:
admin@pixlr.to(Owner)itadmin@pixlr.com(Admin)
Data Sources
Pixlr (MongoDB)
- Collections: users, sessions, images, subscriptions, events
- Sync Frequency: Every 15 minutes
- Sync Type: Incremental
- Connection Details:
Host: [MongoDB Host]
Port: 27017
Database: [Database Name]
Authentication: Username/Password
Designs.ai (MySQL)
- Tables: users, projects, subscriptions, usage_logs, billing
- Sync Frequency: Every 30 minutes
- Sync Type: Incremental
- Connection Details:
Host: [MySQL Host]
Port: 3306
Database: [Database Name]
Authentication: Username/Password
Destination
- ClickHouse Hosted Service
- Database: analytics
- Connection: Secure connection from AWS to ClickHouse
Common Issues & Resolutions
Port Conflicts
# Check for processes using port 8000
sudo lsof -i :8000
# If needed, use alternative port
kubectl port-forward --address 0.0.0.0 svc/airbyte-abctl-airbyte-server-svc 8080:8001 -n airbyte-abctl
Reinstallation
# Uninstall existing installation
abctl local uninstall --persisted
# Clean up
sudo rm -rf ~/.airbyte
sudo rm -rf ~/.kube
# Reinstall
abctl local install --host $AIRBYTE_HOSTNAME --insecure-cookies
Pod Issues
# Check pod status
kubectl get pods -n airbyte-abctl
# View logs for a specific pod
kubectl logs -f <pod-name> -n airbyte-abctl
# Delete and restart a problematic pod
kubectl delete pod <pod-name> -n airbyte-abctl
Maintenance & Troubleshooting
Common Issues
-
Permission Denied Errors
# fix permissions
sudo chown -R 1000:1000 ~/.airbyte
sudo chmod -R 755 ~/.airbyte -
Reset Admin Password
# Get container ID
docker ps | grep airbyte-server
# Reset password
docker exec -it [CONTAINER_ID] reset-password
Backup & Restore
To backup Airbyte configuration and data:
# Backup volumes
docker run --rm -v airbyte_workspace:/source -v $(pwd):/backup alpine tar czf /backup/airbyte_backup_$(date +%Y%m%d).tar.gz -C /source .
# Restore from backup
docker run --rm -v airbyte_workspace:/target -v $(pwd):/backup alpine sh -c "cd /target && tar xzf /backup/airbyte_backup_20230930.tar.gz"
Key Benefits
- Cost Control: No per-data pricing
- Data Control: Full control over data processing
- Flexibility: Custom connectors and transformations
- Security: Data stays within our infrastructure
Next Steps
- ClickHouse - Hosted data warehouse
- SQLMesh - CI/CD data transformation