AWS Glue Publishing
LakeXpress creates AWS Glue Data Catalog tables from exported Parquet files, enabling queries via Athena, Redshift Spectrum, and EMR.
Table of Contents
Prerequisites
1. AWS Glue Permissions
Required IAM permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:CreateDatabase",
"glue:GetDatabase",
"glue:DeleteDatabase",
"glue:CreateTable",
"glue:GetTable",
"glue:DeleteTable",
"glue:UpdateTable",
"glue:GetTables",
"glue:BatchCreatePartition",
"glue:GetPartitions"
],
"Resource": "*"
}
]
}
2. S3 Storage Setup
Both S3 and Glue credentials are required in credentials.json:
{
"aws_s3_datalake": {
"ds_type": "s3",
"auth_mode": "profile",
"info": {
"directory": "s3://my-datalake-bucket/lakexpress/",
"profile": "my-aws-profile"
}
},
"glue_catalog": {
"ds_type": "aws_glue",
"auth_mode": "profile",
"info": {
"profile": "my-aws-profile",
"region": "us-east-1"
}
}
}
Critical: The S3 bucket must be accessible from the Glue Data Catalog in the specified region.
Authentication Modes
Profile Authentication (Recommended)
Uses AWS CLI credentials from ~/.aws/credentials:
{
"glue_catalog": {
"ds_type": "aws_glue",
"auth_mode": "profile",
"info": {
"profile": "my-aws-profile",
"region": "us-east-1"
}
}
}
- No secrets in config files
- Supports MFA and SSO profiles
- Easy credential rotation
Keys Authentication
Uses explicit AWS access keys:
{
"glue_catalog": {
"ds_type": "aws_glue",
"auth_mode": "keys",
"info": {
"aws_access_key_id": "AKIAIOSFODNN7EXAMPLE",
"aws_secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"region": "us-east-1"
}
}
}
Suited for CI/CD pipelines, containers, or cross-account access without AWS CLI.
Role Authentication
Assumes an IAM role via STS:
{
"glue_catalog": {
"ds_type": "aws_glue",
"auth_mode": "role",
"info": {
"role_arn": "arn:aws:iam::123456789012:role/GluePublishRole",
"external_id": "optional-external-id",
"region": "us-east-1"
}
}
}
Uses temporary credentials. Ideal for cross-account publishing and fine-grained access control.
Configuration Options
| Option | Description | Default |
|---|---|---|
--publish_target ID | Credential ID for Glue target (required) | - |
--publish_schema_pattern PATTERN | Glue database naming pattern | {schema} |
--publish_table_pattern PATTERN | Table naming pattern | {table} |
--glue_skip_existing | Skip existing tables instead of recreating | false |
--n_jobs N | Parallel workers for table creation | 1 |
Dynamic Naming Patterns
Database and table names support token-based patterns.
Supported Tokens
| Token | Description | Example Output |
|---|---|---|
{schema} | Source schema name | tpch_1 |
{table} | Source table name | customer |
{database} | Source database name | tpch |
{date} | Current date (YYYYMMDD) | 20251210 |
{timestamp} | Current timestamp (YYYYMMDD_HHMMSS) | 20251210_143022 |
{uuid} | UUID4 (consistent per run) | a1b2c3d4-... |
{subpath} | CLI --sub_path value | staging |
Common Patterns
Date-Partitioned Databases
lakexpress \
--publish_schema_pattern "lx_{schema}_{date}" \
--publish_table_pattern "{table}" \
--publish_target glue_catalog \
...
# Results:
# Database: lx_tpch_1_20251210
# Tables: customer, orders, lineitem
Prefixed Databases
lakexpress \
--publish_schema_pattern "datalake_{schema}" \
--publish_table_pattern "{table}" \
--publish_target glue_catalog \
...
# Results:
# Database: datalake_tpch_1
# Tables: customer, orders, lineitem
Consolidated Multi-Schema
lakexpress \
--source_schema_name schema1,schema2 \
--publish_schema_pattern "consolidated" \
--publish_table_pattern "{schema}_{table}" \
--publish_target glue_catalog \
...
# Results:
# Database: consolidated
# Tables: schema1_customer, schema2_customer
Usage Examples
Example 1: Basic Export to Glue
lakexpress -a credentials.json \
--lxdb_auth_id lxdb \
--source_db_auth_id postgres_prod \
--source_schema_name public \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target glue_catalog
Creates database public with all tables from the public schema.
Example 2: Custom Naming with Parallel Execution
lakexpress -a credentials.json \
--lxdb_auth_id lxdb \
--source_db_auth_id postgres_prod \
--source_schema_name tpch_1 \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target glue_catalog \
--publish_schema_pattern "lx_{schema}" \
--publish_table_pattern "{table}" \
--n_jobs 4
Creates database lx_tpch_1 with tables built in parallel.
Example 3: Skip Existing Tables
lakexpress -a credentials.json \
--lxdb_auth_id lxdb \
--source_db_auth_id postgres_prod \
--source_schema_name tpch_1 \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target glue_catalog \
--glue_skip_existing
Example 4: Date-Based Snapshots
lakexpress -a credentials.json \
--lxdb_auth_id lxdb \
--source_db_auth_id postgres_prod \
--source_schema_name sales \
--target_storage_id aws_s3_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target glue_catalog \
--publish_schema_pattern "sales_{date}" \
--sub_path "daily/$(date +%Y%m%d)"
Creates database sales_20251210, data stored at s3://bucket/lakexpress/daily/20251210/.
See Also
- AWS Glue Reference - Data type mappings, CLI arguments, querying
- CLI Reference - All command-line options
- Examples & Recipes - Working command examples