Regular expression to find all table names in a query

I am not that hot at regular expressions and it has made my little mind melt some what.

I am trying to find all the tables names in a query. So say I have the query:

SELECT one, two, three FROM table1, table2 WHERE X=Y

I would like to pull out "table1, table2" or "table1" and "table2"

But what if there is no where statement. It could be the end of the file, or there could be a group by or an order by etc. I know "most" of the time this will not be an issue but I don't like the idea of coding for "most" situations and knowing I have left a hole that could cause things to go wrong later.

Is this a doable Regex expression? Am I being a Regex pleb?

(P.S. this will be done in C# but presume that doesn't matter much).

Answers


RegEx isn't very good at this, as it's a lot more complicated than it appears:

  • What if they use LEFT/RIGHT INNER/OUTER/CROSS/MERGE/NATURAL joins instead of the a,b syntax? The a,b syntax should be avoided anyway.
  • What about nested queries?
  • What if there is no table (selecting a constant)
  • What about line breaks and other whitespace formatting?
  • Alias names?

I could go on.

What you can do is look for an sql parser, and run your query through that.


Everything said about the usefulness of such a regex in the SQL context. If you insist on a regex and your SQL statements always look like the one you showed (that means no subqueries, joins, and so on), you could use

FROM\s+([^ ,]+)(?:\s*,\s*([^ ,]+))*\s+ 

I'm pretty late to the party however I thought I would share a regex I am currently using to analyse all our database objects and I disagree with the sentiment that it is not possible to do this using one.

The regex has a few assumptions

1) You are not using the A,B join syntax style

2) Whatever regex parser you are using supports ignore case.

3) You're analyzing, selects, joins, updates, deletes and truncates. It doesn't support the aforementioned MERGE/NATURAL because we don't use them, however I'm sure further support wouldn't be difficult to add.

I am keen to know what type of transaction the table is part of so I have included Named Capture groups to tell me.

Now I've not used regex for a long time so there are probably improvements that can be made however so far in all my testing this is accurate.

\bjoin\s+(?<Retrieve>[a-zA-Z\._\d]+)\b|\bfrom\s+(?<Retrieve>[a-zA-Z\._\d]+)\b|\bupdate\s+(?<Update>[a-zA-Z\._\d]+)\b|\binsert\s+(?:\binto\b)?\s+(?<Insert>[a-zA-Z\._\d]+)\b|\btruncate\s+table\s+(?<Delete>[a-zA-Z\._\d]+)\b|\bdelete\s+(?:\bfrom\b)?\s+(?<Delete>[a-zA-Z\._\d]+)\b

I found this site that has a GREAT parser!

http://www.sqlparser.com/

well worth it. Works a treat.


It's definitely not easy.

Consider subqueries.

select
  *
from
  A
  join (
    select
       top 5 *
    from
      B)
    on B.ID = A.ID
where
  A.ID in (
    select
      ID
    from
      C
    where C.DOB = A.DOB)

There are three tables used in this query.


I think it would be easier to tokenize the string and look for SQL keywords that could bound the table names. You know the names will follow FROM, but they could be followed by WHERE, GROUP BY, HAVING, or no keyword at all if they're at the end of the query.


One workaround is to implement a naming convention on tables and views. Then the SQL statement can be parsed on the naming prefix.

For example:

SELECT tbltable1.one, tbltable1.two, tbltable2.three
FROM tbltable1
    INNER JOIN  tbltable2
        ON tbltable1.one = tbltable2.three

Split whitespace to array:

("SELECT","tbltable1.one,","tbltable1.two,","tbltable2.three","FROM","tbltable1","INNER","JOIN","tbltable2","ON","tbltable1.one","=","tbltable2.three")

Get left of elements to period:

("SELECT","tbltable1","tbltable1","tbltable2","FROM","tbltable1","INNER","JOIN","tbltable2","ON","tbltable1","=","tbltable2")

Remove elements with symbols:

("SELECT","tbltable1","tbltable1","tbltable2","FROM","tbltable1","INNER","JOIN","tbltable2","ON","tbltable1","tbltable2")

Reduce to unique values:

("SELECT","tbltable1","tbltable2","FROM","INNER","JOIN","ON")

Filter on Left 3 characters = "tbl"

("tbltable1","tbltable2")


Constructing a regular expression is going to be the least of your problems. Depending on the flavor of SQL you expect to support with this code, the number of ways you can reference a table in a SQL statement is staggering.

PLUS, if the query includes a reference to a view or UDF, the information about what underlying tables won't even be in the string at all making it completely impractical to get that information by parsing it. Also, you'd need to be smart about detecting temporary tables and excluding them from your results.

If you must do this, a better approach would be to make use of the APIs to the particular database engine that the SQL was intended for. For example you could create a view based on the query and then use the DB Server api to detect dependencies for that view. The DB engine is going to be able to parse it much more reliably than you ever will without an enormous effort to reverse engineer the query engine.

If, by chance, you are working with SQL Server, here is an article about detecting dependencies on that platform: Finding Dependencies in SQL Server 2005


This will pull out a table name on an insert Into query:

(?<=(INTO)\s)[^\s]*(?=\(())

The Following will do the same but with a select including joins

(?<=(from|join)\s)[^\s]*(?=\s(on|join|where))

Finally going back to an insert if you want to return just the values that are held in an insert query use the following Regex

(?i)(?<=VALUES[ ]*\().*(?=\))

I know this is an old thread but it may assist someone else looking around

Enjoy


I tried all the above but none worked since I use a wide variety of queries. I'm working with PHP though and used a PEAR library called SQL_Parser, but hope my solution helps. Also, I was having trouble with apostrophes and MySQL reserved sencences so I decided to strip off all the fields section from the query before parsing it.

function getQueryTable ($query) {
    require_once "SQL/Parser.php";
    $parser = new SQL_Parser();
    $parser->setDialect('MySQL');

    // Stripping fields section
    $queryType = substr(strtoupper($query),0,6);            
    if($queryType == 'SELECT') { $query  = "SELECT * ".stristr($query, "FROM"); }
    if ($havingPos = stripos($query, 'HAVING')) { $query = substr($query, 0, $havingPos); }


    $struct = $parser->parse($query);

    $tableReferences = $struct[0]['from']['table_references']['table_factors'];

    foreach ((Array) $tableReferences as $ref) {
        $tables[] = ($ref['database'] ? $ref['database'].'.' : $ref['database']).$ref['table'];
    }

    return $tables;

}

In PHP, I use this function, it returns an array with the table names used in a sql statement:

function sql_query_get_tables($statement){
    preg_match_all("/(from|into|update|join) [\\'\\´]?([a-zA-Z0-9_-]+)[\\'\\´]?/i",
            $statement, $matches);
    if(!empty($matches)){
        return array_unique($matches[2]);
    }else return array();
}

Notice that it does not work with a,b joins or schema.tablename naming

I hope it works for you


I used this code as an Excel macro to parse the select and extract table names.

My parsing assumes that the syntax select from a , b , c is not used.

Just run it against your SQL query and if you are not satisfied with the result you should be only a few lines of codes away from the result you expect. Just debug and modify the code accordingly.

Sub get_tables()
    sql_query = Cells(5, 1).Value
    tables = ""

    'get all tables after from
    sql_from = sql_query

    While InStr(1, UCase(sql_from), UCase("from")) > 0

        i = InStr(1, UCase(sql_from), UCase("from"))
        sql_from = Mid(sql_from, i + 5, Len(sql_from) - i - 5)
        i = InStr(1, UCase(sql_from), UCase(" "))

        While i = 1

            sql_from = Mid(sql_from, 2, Len(sql_from) - 1)
            i = InStr(1, UCase(sql_from), UCase(" "))

        end

        i = InStr(1, sql_join, Chr(9))

        While i = 1

            sql_join = Mid(sql_join, 2, Len(sql_join) - 1)
            i = InStr(1, sql_join, Chr(9))

        end

        a = InStr(1, UCase(sql_from), UCase(" "))
        b = InStr(1, sql_from, Chr(10))
        c = InStr(1, sql_from, Chr(13))
        d = InStr(1, sql_from, Chr(9))

        MinC = a

        If MinC > b And b > 0 Then MinC = b
        If MinC > c And c > 0 Then MinC = c
        If MinC > d And d > 0 Then MinC = d

        tables = tables + "[" + Mid(sql_from, 1, MinC - 1) + "]"

    end

    'get all tables after join
    sql_join = sql_query

    While InStr(1, UCase(sql_join), UCase("join")) > 0

        i = InStr(1, UCase(sql_join), UCase("join"))
        sql_join = Mid(sql_join, i + 5, Len(sql_join) - i - 5)
        i = InStr(1, UCase(sql_join), UCase(" "))

        While i = 1

            sql_join = Mid(sql_join, 2, Len(sql_join) - 1)
            i = InStr(1, UCase(sql_join), UCase(" "))

        end

        i = InStr(1, sql_join, Chr(9))

        While i = 1

            sql_join = Mid(sql_join, 2, Len(sql_join) - 1)
            i = InStr(1, sql_join, Chr(9))

        end

        a = InStr(1, UCase(sql_join), UCase(" "))
        b = InStr(1, sql_join, Chr(10))
        c = InStr(1, sql_join, Chr(13))
        d = InStr(1, sql_join, Chr(9))

        MinC = a

        If MinC > b And b > 0 Then MinC = b
        If MinC > c And c > 0 Then MinC = c
        If MinC > d And d > 0 Then MinC = d

        tables = tables + "[" + Mid(sql_join, 1, MinC - 1) + "]"

    end

    tables = Replace(tables, ")", "")
    tables = Replace(tables, "(", "")
    tables = Replace(tables, " ", "")
    tables = Replace(tables, Chr(10), "")
    tables = Replace(tables, Chr(13), "")
    tables = Replace(tables, Chr(9), "")
    tables = Replace(tables, "[]", "")

End Sub

Need Your Help

php, user-uploaded files, version control, and website deployment

php svn version-control architecture deployment

I have a website that I regularly update the code to. I keep it in version control. When I want to deploy a new version of the site, I do an export and then symlink the served directory name to the

Programmatically setting Absolute URL inside image controls for .NET Facebook Application

.net url facebook asp.net absolute

The goal is to allow a function to be called from within an ASPX imagebutton control that is used as part of a form. The need for an absolute URL is that Facebook Canvas applications require them. ...