BigQuery - Check if table already exists

I have a dataset in BigQuery. This dataset contains multiple tables.

I am doing the following steps programmatically using the BigQuery API:

  1. Querying the tables in the dataset - Since my response is too large, I am enabling allowLargeResults parameter and diverting my response to a destination table.

  2. I am then exporting the data from the destination table to a GCS bucket.

Requirements:

  • Suppose my process fails at Step 2, I would like to re-run this step.

  • But before I re-run, I would like to check/verify that the specific destination table named 'xyz' already exists in the dataset.

  • If it exists, I would like to re-run step 2.

  • If it does not exist, I would like to do foo.

How can I do this?

Thanks in advance.

Answers


Here is a python snippet that will tell whether a table exists (deleting it in the process--careful!):

def doesTableExist(project_id, dataset_id, table_id):
  bq.tables().delete(
      projectId=project_id, 
      datasetId=dataset_id,
      tableId=table_id).execute()
  return False

Alternately, if you'd prefer not deleting the table in the process, you could try:

def doesTableExist(project_id, dataset_id, table_id):
  try:
    bq.tables().get(
        projectId=project_id, 
        datasetId=dataset_id,
        tableId=table_id).execute()
    return True
  except HttpError, err
    if err.resp.status <> 404:
       raise
    return False

If you want to know where bq came from, you can call build_bq_client from here: http://code.google.com/p/bigquery-e2e/source/browse/samples/ch12/auth.py

In general, if you're using this to test whether you should run a job that will modify the table, it can be a good idea to just do the job anyway, and use WRITE_TRUNCATE as a write disposition.

Another approach can be to create a predictable job id, and retry the job with that id. If the job already exists, the job already ran (you might want to double check to make sure the job didn't fail, however).


Alex F's solution works on v0.27, but will not work on later versions. In order to migrate to v0.28+, the below solution will work.

from google.cloud import bigquery

project_nm = 'gc_project_nm'
dataset_nm = 'ds_nm'
table_nm = 'tbl_nm'

client = bigquery.Client(project_nm)
dataset = client.dataset(dataset_nm)
table_ref = dataset.table(table_nm)

def if_tbl_exists(client, table_ref):
    from google.cloud.exceptions import NotFound
    try:
        client.get_table(table_ref)
        return True
    except NotFound:
        return False

if_tbl_exists(client, table_ref)

Enjoy:

def doesTableExist(bigquery, project_id, dataset_id, table_id):
    try:
        bigquery.tables().get(
            projectId=project_id, 
            datasetId=dataset_id,
            tableId=table_id).execute()
        return True
    except Exception as err:
        if err.resp.status != 404:
            raise
        return False

There is an edit in exception.


With my_bigquery being an instance of class google.cloud.bigquery.Client (already authentified and associated to a project):

my_bigquery.dataset(dataset_name).table(table_name).exists()  # returns boolean

It does an API call to test for the existence of the table via a GET request

Source: https://googlecloudplatform.github.io/google-cloud-python/0.24.0/bigquery-table.html#google.cloud.bigquery.table.Table.exists

It works for me using 0.27 of the Google Bigquery Python module


Inline SQL Alternative

tarheel's answer is probably the most correct at this point in time

but I was considering the comment from Ivan above that "404 could also mean the resource is not there for a bunch of reasons", so here is a solution that should always successfully run a metadata query and return a result.

It's not the fastest, because it always has to run the query, bigquery has overhead for small queries

A trick I've seen previously is to query information_schema for a (table) object, and union that to a fake query that ensures a record is always returned even if the the object doesn't. There's also a LIMIT 1 and an ordering to ensure the single record returned represents the table, if it does exist. See the SQL in the code below.

  • In spite of doc claims that Bigquery standard SQL is ISO compliant, they don't support information_schema, but they do have __table_summary__
  • dataset is required because you can't query __table_summary__ without specifying dataset
  • dataset is not a parameter in the SQL because you can't parameterize object names without sql injection issues (apart from with the magical _TABLE_SUFFIX, see https://cloud.google.com/bigquery/docs/querying-wildcard-tables )
#!/usr/bin/env python
"""
Inline SQL way to check a table exists in Bigquery

e.g.
print(table_exists(dataset_name='<dataset_goes_here>', table_name='<real_table_name'))
True

print(table_exists(dataset_name='<dataset_goes_here>', table_name='imaginary_table_name'))
False
"""
from __future__ import print_function
from google.cloud import bigquery


def table_exists(dataset_name, table_name):
    client = bigquery.Client()
    query = """
        SELECT table_exists FROM
        (
          SELECT true as table_exists, 1 as ordering
          FROM __TABLES_SUMMARY__ WHERE table_id = @table_name
          UNION ALL
          SELECT false as table_exists, 2 as ordering
        ) ORDER by ordering LIMIT 1"""
    query_params = [bigquery.ScalarQueryParameter('table_name', 'STRING', table_name)]
    job_config = bigquery.QueryJobConfig()
    job_config.query_parameters = query_params

    if dataset_name is not None:
        dataset_ref = client.dataset(dataset_name)
        job_config.default_dataset = dataset_ref

    query_job = client.query(
        query,
        job_config=job_config
    )

    results = query_job.result()
    for row in results:
        # There is only one row because LIMIT 1 in the SQL
        return row.table_exists

Need Your Help

What design pattern to implement in order to use a REST API?

java design-patterns api oop rest

I'm starting a new project that will require use an external REST API that basically returns JSON and XML files. It's similar to the StackExchange API and I see every wrapper that has be done for t...

How to check if SQL Server 2008R2 merge pull subscription is set to reinitialize?

sql-server-2008-r2 merge-replication

I am using RMO and some T-SQL to manage my subscriptions and have a need to know if a subscription has been set to reinitialize on next sync.