BigQuery - Check if table already exists
I have a dataset in BigQuery. This dataset contains multiple tables.
I am doing the following steps programmatically using the BigQuery API:
Querying the tables in the dataset - Since my response is too large, I am enabling allowLargeResults parameter and diverting my response to a destination table.
I am then exporting the data from the destination table to a GCS bucket.
Suppose my process fails at Step 2, I would like to re-run this step.
But before I re-run, I would like to check/verify that the specific destination table named 'xyz' already exists in the dataset.
If it exists, I would like to re-run step 2.
If it does not exist, I would like to do foo.
How can I do this?
Thanks in advance.
Here is a python snippet that will tell whether a table exists (deleting it in the process--careful!):
def doesTableExist(project_id, dataset_id, table_id): bq.tables().delete( projectId=project_id, datasetId=dataset_id, tableId=table_id).execute() return False
Alternately, if you'd prefer not deleting the table in the process, you could try:
def doesTableExist(project_id, dataset_id, table_id): try: bq.tables().get( projectId=project_id, datasetId=dataset_id, tableId=table_id).execute() return True except HttpError, err if err.resp.status <> 404: raise return False
If you want to know where bq came from, you can call build_bq_client from here: http://code.google.com/p/bigquery-e2e/source/browse/samples/ch12/auth.py
In general, if you're using this to test whether you should run a job that will modify the table, it can be a good idea to just do the job anyway, and use WRITE_TRUNCATE as a write disposition.
Another approach can be to create a predictable job id, and retry the job with that id. If the job already exists, the job already ran (you might want to double check to make sure the job didn't fail, however).
Alex F's solution works on v0.27, but will not work on later versions. In order to migrate to v0.28+, the below solution will work.
from google.cloud import bigquery project_nm = 'gc_project_nm' dataset_nm = 'ds_nm' table_nm = 'tbl_nm' client = bigquery.Client(project_nm) dataset = client.dataset(dataset_nm) table_ref = dataset.table(table_nm) def if_tbl_exists(client, table_ref): from google.cloud.exceptions import NotFound try: client.get_table(table_ref) return True except NotFound: return False if_tbl_exists(client, table_ref)
def doesTableExist(bigquery, project_id, dataset_id, table_id): try: bigquery.tables().get( projectId=project_id, datasetId=dataset_id, tableId=table_id).execute() return True except Exception as err: if err.resp.status != 404: raise return False
There is an edit in exception.
With my_bigquery being an instance of class google.cloud.bigquery.Client (already authentified and associated to a project):
my_bigquery.dataset(dataset_name).table(table_name).exists() # returns boolean
It does an API call to test for the existence of the table via a GET request
It works for me using 0.27 of the Google Bigquery Python module
Inline SQL Alternative
tarheel's answer is probably the most correct at this point in time
but I was considering the comment from Ivan above that "404 could also mean the resource is not there for a bunch of reasons", so here is a solution that should always successfully run a metadata query and return a result.
It's not the fastest, because it always has to run the query, bigquery has overhead for small queries
A trick I've seen previously is to query information_schema for a (table) object, and union that to a fake query that ensures a record is always returned even if the the object doesn't. There's also a LIMIT 1 and an ordering to ensure the single record returned represents the table, if it does exist. See the SQL in the code below.
- In spite of doc claims that Bigquery standard SQL is ISO compliant, they don't support information_schema, but they do have __table_summary__
- dataset is required because you can't query __table_summary__ without specifying dataset
- dataset is not a parameter in the SQL because you can't parameterize object names without sql injection issues (apart from with the magical _TABLE_SUFFIX, see https://cloud.google.com/bigquery/docs/querying-wildcard-tables )
#!/usr/bin/env python """ Inline SQL way to check a table exists in Bigquery e.g. print(table_exists(dataset_name='<dataset_goes_here>', table_name='<real_table_name')) True print(table_exists(dataset_name='<dataset_goes_here>', table_name='imaginary_table_name')) False """ from __future__ import print_function from google.cloud import bigquery def table_exists(dataset_name, table_name): client = bigquery.Client() query = """ SELECT table_exists FROM ( SELECT true as table_exists, 1 as ordering FROM __TABLES_SUMMARY__ WHERE table_id = @table_name UNION ALL SELECT false as table_exists, 2 as ordering ) ORDER by ordering LIMIT 1""" query_params = [bigquery.ScalarQueryParameter('table_name', 'STRING', table_name)] job_config = bigquery.QueryJobConfig() job_config.query_parameters = query_params if dataset_name is not None: dataset_ref = client.dataset(dataset_name) job_config.default_dataset = dataset_ref query_job = client.query( query, job_config=job_config ) results = query_job.result() for row in results: # There is only one row because LIMIT 1 in the SQL return row.table_exists