Skip to main content

BigQuery Publishing

LakeXpress can automatically create BigQuery tables from Parquet files stored in GCS.

Table of Contents

Prerequisites

1. Google Cloud Permissions

Required IAM roles:

  • BigQuery Data Editor (roles/bigquery.dataEditor) - Create datasets and tables
  • BigQuery Job User (roles/bigquery.jobUser) - Run load jobs (native tables)
  • Storage Object Viewer (roles/storage.objectViewer) - Read GCS files (external tables)

Or a custom role with these permissions:

bigquery.datasets.create
bigquery.datasets.get
bigquery.tables.create
bigquery.tables.get
bigquery.tables.delete
bigquery.tables.updateData
bigquery.jobs.create
storage.objects.get
storage.objects.list

2. GCS Storage Setup

Your credentials.json needs both GCS and BigQuery entries:

{
"gcs_datalake": {
"ds_type": "gcs",
"auth_mode": "profile",
"info": {
"directory": "gs://my-datalake-bucket/lakexpress/",
"profile": "/path/to/service-account.json"
}
},
"bigquery_prod": {
"ds_type": "bigquery",
"auth_mode": "profile",
"info": {
"project_id": "my-gcp-project",
"location": "US",
"profile": "/path/to/service-account.json"
}
}
}

The same service account JSON file can be used for both GCS and BigQuery.

Authentication

LakeXpress uses service account authentication for BigQuery.

{
"bigquery_prod": {
"ds_type": "bigquery",
"auth_mode": "profile",
"info": {
"project_id": "my-gcp-project",
"location": "US",
"profile": "/path/to/service-account.json"
}
}
}
FieldRequiredDescription
ds_typeYesMust be "bigquery"
auth_modeYesMust be "profile"
project_idYesGCP project ID where datasets will be created
locationNoBigQuery dataset location (default: "US")
profileYesPath to service account JSON key file

Common locations: US, EU, us-central1, europe-west1.

Creating a Service Account

  1. Go to Google Cloud Console > IAM & Admin > Service Accounts
  2. Create a service account with BigQuery and Storage permissions
  3. Create and download a JSON key
  4. Reference the key file path in your credentials

Configuration Options

OptionDescriptionDefault
--publish_target IDCredential ID for BigQuery target (required)-
--publish_schema_pattern PATTERNBigQuery dataset naming pattern{schema}
--publish_table_pattern PATTERNTable naming pattern{table}
--publish_method TYPETable type: external or internalexternal
--n_jobs NParallel workers for table creation1

Table Types

External Tables (Default)

Queries data directly from GCS Parquet files without loading into BigQuery storage.

lakexpress ... --publish_target bigquery_prod --publish_method external
  • Fast publishing, no data copying
  • No BigQuery storage costs
  • Always reflects latest GCS data
  • Best for large datasets with infrequent queries
  • Query speed depends on GCS read performance
  • No clustering or partitioning support

Native Tables

Loads data from GCS into BigQuery's columnar storage.

lakexpress ... --publish_target bigquery_prod --publish_method internal
  • Faster queries via optimized columnar storage
  • Full BigQuery features (clustering, partitioning, materialized views)
  • Best for frequently queried data
  • Slower publishing due to data loading
  • Incurs BigQuery storage costs
  • Data is a point-in-time snapshot

Dynamic Naming Patterns

Dataset and table names support token-based patterns.

Supported Tokens

TokenDescriptionExample Output
{schema}Source schema nametpch_1
{table}Source table namecustomer
{database}Source database nametpch
{date}Current date (YYYYMMDD)20251210
{timestamp}Current timestamp (YYYYMMDD_HHMMSS)20251210_143022
{uuid}UUID4 identifier (consistent per run)a1b2c3d4
{subpath}CLI --sub_path parameterstaging

Common Patterns

Prefixed Datasets

lakexpress \
--publish_schema_pattern "lx_{schema}" \
--publish_table_pattern "{table}" \
--publish_target bigquery_prod \
...

# Results:
# Dataset: lx_tpch_1
# Tables: customer, orders, lineitem

Date-Partitioned Datasets

lakexpress \
--publish_schema_pattern "{schema}_{date}" \
--publish_table_pattern "{table}" \
--publish_target bigquery_prod \
...

# Results:
# Dataset: tpch_1_20251210
# Tables: customer, orders, lineitem

Consolidated Multi-Schema

lakexpress \
--source_schema_name schema1,schema2 \
--publish_schema_pattern "consolidated" \
--publish_table_pattern "{schema}_{table}" \
--publish_target bigquery_prod \
...

# Results:
# Dataset: consolidated
# Tables: schema1_customer, schema2_customer

Usage Examples

Example 1: External Tables (Default)

lakexpress -a credentials.json \
--lxdb_auth_id lxdb \
--source_db_auth_id postgres_prod \
--source_schema_name public \
--target_storage_id gcs_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_prod

Creates dataset public with external tables pointing to GCS Parquet files.

Example 2: Native Tables with Parallel Execution

lakexpress -a credentials.json \
--lxdb_auth_id lxdb \
--source_db_auth_id postgres_prod \
--source_schema_name tpch_1 \
--target_storage_id gcs_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_prod \
--publish_method internal \
--publish_schema_pattern "lx_{schema}" \
--n_jobs 4

Creates dataset lx_tpch_1 with native tables loaded from GCS.

Example 3: Daily Snapshots

lakexpress -a credentials.json \
--lxdb_auth_id lxdb \
--source_db_auth_id postgres_prod \
--source_schema_name sales \
--target_storage_id gcs_datalake \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_prod \
--publish_schema_pattern "sales_{date}" \
--sub_path "daily/$(date +%Y%m%d)"

Creates dataset sales_20251210, data stored in gs://bucket/lakexpress/daily/20251210/.

Example 4: EU Data Residency

{
"bigquery_eu": {
"ds_type": "bigquery",
"auth_mode": "profile",
"info": {
"project_id": "my-eu-project",
"location": "EU",
"profile": "/path/to/service-account.json"
}
}
}
lakexpress -a credentials.json \
--lxdb_auth_id lxdb \
--source_db_auth_id postgres_prod \
--source_schema_name gdpr_data \
--target_storage_id gcs_eu \
--fastbcp_dir_path /path/to/FastBCP \
--publish_target bigquery_eu

See Also

Copyright © 2026 Architecture & Performance.