Version: 0.4 (Latest)

AWS Glue Publishing

LakeXpress creates AWS Glue Data Catalog tables from exported Parquet files, enabling queries via Athena, Redshift Spectrum, and EMR.

Prerequisites
Authentication Modes
Configuration Options
Dynamic Naming Patterns
Usage Examples

Prerequisites

1. AWS Glue Permissions

Required IAM permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:GetDatabase",
                "glue:DeleteDatabase",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:DeleteTable",
                "glue:UpdateTable",
                "glue:GetTables",
                "glue:BatchCreatePartition",
                "glue:GetPartitions"
            ],
            "Resource": "*"
        }
    ]
}

2. S3 Storage Setup

Both S3 and Glue credentials are required in credentials.json:

{
    "aws_s3_datalake": {
        "ds_type": "s3",
        "auth_mode": "profile",
        "info": {
            "directory": "s3://my-datalake-bucket/lakexpress/",
            "profile": "my-aws-profile"
        }
    },
    "glue_catalog": {
        "ds_type": "aws_glue",
        "auth_mode": "profile",
        "info": {
            "profile": "my-aws-profile",
            "region": "us-east-1"
        }
    }
}

tip

Both "aws_glue" and "glue" are accepted as the ds_type value.

Critical: The S3 bucket must be accessible from the Glue Data Catalog in the specified region.

Authentication Modes

Profile Authentication (Recommended)

Uses AWS CLI credentials from ~/.aws/credentials:

{
    "glue_catalog": {
        "ds_type": "aws_glue",
        "auth_mode": "profile",
        "info": {
            "profile": "my-aws-profile",
            "region": "us-east-1"
        }
    }
}

No secrets in config files
Supports MFA and SSO profiles
Easy credential rotation

Keys Authentication

Uses explicit AWS access keys:

{
    "glue_catalog": {
        "ds_type": "aws_glue",
        "auth_mode": "keys",
        "info": {
            "aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
            "aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
            "region": "us-east-1"
        }
    }
}

Suited for CI/CD pipelines, containers, or cross-account access without AWS CLI.

Role Authentication

Assumes an IAM role via STS:

{
    "glue_catalog": {
        "ds_type": "aws_glue",
        "auth_mode": "role",
        "info": {
            "role_arn": "arn:aws:iam::123456789012:role/GluePublishRole",
            "external_id": "optional-external-id",
            "region": "us-east-1"
        }
    }
}

Uses temporary credentials. Ideal for cross-account publishing and fine-grained access control.

Understanding Glue Database Naming

No target database setting needed

Unlike traditional databases, the AWS Glue Data Catalog has no server or database instance to connect to. What Glue calls a "database" is simply a namespace — a container for organizing tables within the catalog. The catalog itself (scoped by catalog_id, which defaults to your AWS account) is the top-level container.

This is why there is no --publish_database_name option for Glue. Instead, --publish_schema_pattern directly controls the Glue database name that will be created. For example, --publish_schema_pattern "datalake_{schema}" creates a Glue database called datalake_tpch_1.

Configuration Options

Option	Description	Default
`--publish_target ID`	Credential ID for Glue target (required)	-
`--publish_schema_pattern PATTERN`	Glue database naming pattern	`{schema}`
`--publish_table_pattern PATTERN`	Table naming pattern	`{table}`
`--glue_skip_existing`	Skip existing tables instead of recreating	`false`
`--n_jobs N`	Parallel workers for table creation	`1`

Dynamic Naming Patterns

Database and table names support token-based patterns.

Token restrictions

The {table} token can only be used in --publish_table_pattern, not in --publish_schema_pattern. Since the schema pattern controls the Glue database name, using {table} there would attempt to create a separate database per table, which is not supported. LakeXpress validates this before starting the export and will report an error.

Supported Tokens

Token	Description	Example Output
`{schema}`	Source schema name	`tpch_1`
`{table}`	Source table name	`customer`
`{database}`	Source database name	`tpch`
`{date}`	Current date (YYYYMMDD)	`20251210`
`{timestamp}`	Current timestamp (YYYYMMDD_HHMMSS)	`20251210_143022`
`{uuid}`	UUID4 (consistent per run)	`a1b2c3d4-...`
`{subpath}`	CLI `--sub_path` value	`staging`

Common Patterns

Date-Partitioned Databases

lakexpress \
  --publish_schema_pattern "lx_{schema}_{date}" \
  --publish_table_pattern "{table}" \
  --publish_target glue_catalog \
  ...

# Results:
# Database: lx_tpch_1_20251210
# Tables: customer, orders, lineitem

Prefixed Databases

lakexpress \
  --publish_schema_pattern "datalake_{schema}" \
  --publish_table_pattern "{table}" \
  --publish_target glue_catalog \
  ...

# Results:
# Database: datalake_tpch_1
# Tables: customer, orders, lineitem

Consolidated Multi-Schema

lakexpress \
  --source_schema_name schema1,schema2 \
  --publish_schema_pattern "consolidated" \
  --publish_table_pattern "{schema}_{table}" \
  --publish_target glue_catalog \
  ...

# Results:
# Database: consolidated
# Tables: schema1_customer, schema2_customer

Usage Examples

Example 1: Basic Export to Glue

lakexpress -a credentials.json \
    --lxdb_auth_id lxdb \
    --source_db_auth_id postgres_prod \
    --source_schema_name public \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target glue_catalog

Creates database public with all tables from the public schema.

Example 2: Custom Naming with Parallel Execution

lakexpress -a credentials.json \
    --lxdb_auth_id lxdb \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target glue_catalog \
    --publish_schema_pattern "lx_{schema}" \
    --publish_table_pattern "{table}" \
    --n_jobs 4

Creates database lx_tpch_1 with tables built in parallel.

Example 3: Skip Existing Tables

lakexpress -a credentials.json \
    --lxdb_auth_id lxdb \
    --source_db_auth_id postgres_prod \
    --source_schema_name tpch_1 \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target glue_catalog \
    --glue_skip_existing

Example 4: Date-Based Snapshots

lakexpress -a credentials.json \
    --lxdb_auth_id lxdb \
    --source_db_auth_id postgres_prod \
    --source_schema_name sales \
    --target_storage_id aws_s3_datalake \
    --fastbcp_dir_path /path/to/FastBCP \
    --publish_target glue_catalog \
    --publish_schema_pattern "sales_{date}" \
    --sub_path "daily/$(date +%Y%m%d)"

Creates database sales_20251210, data stored at s3://bucket/lakexpress/daily/20251210/.

AWS Glue Publishing

Table of Contents

Prerequisites

1. AWS Glue Permissions

2. S3 Storage Setup

Authentication Modes

Profile Authentication (Recommended)

Keys Authentication

Role Authentication

Understanding Glue Database Naming

Configuration Options

Dynamic Naming Patterns

Supported Tokens

Common Patterns

Date-Partitioned Databases

Prefixed Databases

Consolidated Multi-Schema

Usage Examples

Example 1: Basic Export to Glue

Example 2: Custom Naming with Parallel Execution

Example 3: Skip Existing Tables

Example 4: Date-Based Snapshots

See Also

Table of Contents​

Prerequisites​

1. AWS Glue Permissions​

2. S3 Storage Setup​

Authentication Modes​

Profile Authentication (Recommended)​

Keys Authentication​

Role Authentication​

Understanding Glue Database Naming​

Configuration Options​

Dynamic Naming Patterns​

Supported Tokens​

Common Patterns​

Date-Partitioned Databases​

Prefixed Databases​

Consolidated Multi-Schema​

Usage Examples​

Example 1: Basic Export to Glue​

Example 2: Custom Naming with Parallel Execution​

Example 3: Skip Existing Tables​

Example 4: Date-Based Snapshots​

See Also​

Table of Contents

Prerequisites

1. AWS Glue Permissions

2. S3 Storage Setup

Authentication Modes

Profile Authentication (Recommended)

Keys Authentication

Role Authentication

Understanding Glue Database Naming

Configuration Options

Dynamic Naming Patterns

Supported Tokens

Common Patterns

Date-Partitioned Databases

Prefixed Databases

Consolidated Multi-Schema

Usage Examples

Example 1: Basic Export to Glue

Example 2: Custom Naming with Parallel Execution

Example 3: Skip Existing Tables

Example 4: Date-Based Snapshots

See Also