Need approach for working with small subsets of a large dataset

I am facing a conceptual problem that I am having a hard time overcoming. I am hoping the SO folks can help me overcome it with a nudge in the right direction.

I am in the process of doing some ETL work with the source data being very similar and very large. I am loading it into a table that is intended for replication and I only want the most basic of information in this target table.

My source table looks something like this:

I need my target table to reflect it as such:

As you can see I didn't duplicate the InTransit status where it was duplicated in the source table. The steps I am trying to figure out how to achieve are

  1. Get any new distinct rows entered since the last time the query ran. (Easy)
  2. For each TrackingId I need to check if each new status is already the most recent status in the target and if so disregard otherwise go ahead and insert it. Which this means I have to also start at the earliest of the new statuses and go from there. (I have no *(!#in clue how I'll do this)
  3. Do this every 15 minutes so that statuses are kept very recent so step #2 must be performant.

My source table could easily consist of 100k+ rows but having the need to run this every 15 minutes requires me to make sure this is very performant thus why I am really trying to avoid cursors.

Right now the only way I can see to do this is using a CLR sproc but I think there may be better ways thus I am hoping you guys can nudge me in the right direction.

I am sure I am probably leaving something out that you may need so please let me know what info you may need and I'll happily provide.

Thank you in advance!

EDIT: Ok I wasn't explicit enough in my question. My source table is going to contain multiple tracking Ids. It may be up to 100k+ rows containing mulitple TrackingId's and multiple statuses for each trackingId. I have to update the target table as above for each individual tracking Id but my source will be an amalgam of trackingId's.

Answers


Here you go. I'll let you clean it up and do optimizations. one of the sub queries can go into a view and the messy date comparison can be cleaned up. If you're using SQL 2008 R2 then use CAST as DATE instead.

    declare @tbl1 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)

declare @tbl2 table(
id int, Trackingid int, Status varchar(50), StatusDate datetime
)

----Source data
insert into @tbl1 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10  1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(8,1,'InTransit','10/02/10  3:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(4,1,'Delayed','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(5,1,'InTransit','10/03/10 1:01')
insert into @tbl1 (id, trackingid, status, statusdate) values(6,1,'AtDest','10/03/10 2:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(7,1,'Deliv','10/03/10 3:00') --
insert into @tbl1 (id, trackingid, status, statusdate) values(3,2,'InTransit','10/03/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(9,2,'AtDest','10/04/10 1:00')
insert into @tbl1 (id, trackingid, status, statusdate) values(10,2,'Deliv','10/04/10 1:05')
insert into @tbl1 (id, trackingid, status, statusdate) values(11,1,'Delayed','10/02/10 2:05')

----Target data
insert into @tbl2 (id, trackingid, status, statusdate) values(1,1,'PickedUp','10/01/10  1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(2,1,'InTransit','10/02/10 1:00')
insert into @tbl2 (id, trackingid, status, statusdate) values(3,1,'Deliv','10/03/10 3:00')


select d.* from
(
    select 
    * ,
    ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
    from @tbl1
) d

where 
not exists
(
    select RN from
    (
        select 
        * ,
        ROW_NUMBER() OVER(PARTITION BY trackingid, CAST((STR( YEAR( statusdate ) ) + '/' +STR( MONTH(statusdate ) ) + '/' +STR( DAY( statusdate ) )) AS DATETIME) ORDER BY statusdate) AS 'RN'
        from @tbl1
    )f where f.RN = d.RN + 1 and d.status = f.status and f.trackingid = d.trackingid and 
    CAST((STR( YEAR( f.statusdate ) ) + '/' +STR( MONTH(f.statusdate ) ) + '/' +STR( DAY( f.statusdate ) )) AS DATETIME) =
            CAST((STR( YEAR( d.statusdate ) ) + '/' +STR( MONTH(d.statusdate ) ) + '/' +STR( DAY( d.statusdate ) )) AS DATETIME)
)

and
not exists 
(
    select 1 from @tbl2 t2
    where (t2.trackingid = d.trackingid
    and t2.statusdate = d.statusdate
    and t2.status = d.status)
)
and (
    not exists
    (
        select 1 from
        (
            select top 1 * from @tbl2 t2 
            where t2.trackingid = d.trackingid
            order by t2.statusdate desc
        ) g
        where g.status = d.status
    )
    or not exists
    (
        select 1 from
        (
            select top 1 * from @tbl2 t2 
            where t2.trackingid = d.trackingid
            and t2.statusdate <= d.statusdate
            order by t2.statusdate desc
        ) g
        where g.status = d.status
    )
)
order by trackingid,statusdate

Here's a solution without self-joins:

WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
                ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
        FROM    tracking
        WHERE   tackingId = @id
        ),
        qs AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
        FROM    q
        )
SELECT  *
FROM    qs
WHERE   rnn = 1
ORDER BY
        statusDate

Here's a script to check:

DECLARE @tracking TABLE
        (
        id INT NOT NULL PRIMARY KEY,
        trackingId INT NOT NULL,
        status INT,
        statusDate DATETIME
        )

INSERT
INTO    @tracking
SELECT  1, 1, 1, DATEADD(d, 1, '2010-01-01')
UNION ALL
SELECT  2, 1, 2, DATEADD(d, 2, '2010-01-01')
UNION ALL
SELECT  3, 1, 2, DATEADD(d, 3, '2010-01-01')
UNION ALL
SELECT  4, 1, 2, DATEADD(d, 4, '2010-01-01')
UNION ALL
SELECT  5, 1, 3, DATEADD(d, 5, '2010-01-01')
UNION ALL
SELECT  6, 1, 3, DATEADD(d, 6, '2010-01-01')
UNION ALL
SELECT  7, 1, 4, DATEADD(d, 7, '2010-01-01')
UNION ALL
SELECT  8, 1, 2, DATEADD(d, 8, '2010-01-01')
UNION ALL
SELECT  9, 1, 2, DATEADD(d, 9, '2010-01-01')
UNION ALL
SELECT  10, 1, 1, DATEADD(d, 10, '2010-01-01')
;
WITH    q AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (ORDER BY statusDate) AS rn,
                ROW_NUMBER() OVER (PARTITION BY status ORDER BY statusDate) AS rns
        FROM    @tracking
        ),
        qs AS
        (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY rn - rns ORDER BY statusDate) AS rnn
        FROM    q
        )
SELECT  *
FROM    qs
WHERE   rnn = 1
ORDER BY
        statusDate

How well this performs will depend on indexes, and particularly if you are targeting a single TrackingID at a time, but this is one way to use a CTE and self-join to obtain the desired results:

CREATE TABLE #foo
(
    TrackingID INT,
    [Status] VARCHAR(32),
    StatusDate SMALLDATETIME
);

INSERT #foo SELECT 1, 'PickedUp',  '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest',    '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv',     '2010-10-04 17:05';

WITH src AS
(
    SELECT 
        TrackingID,
        [Status],
        StatusDate, 
        ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
    FROM #foo
    WHERE TrackingID = 1
),
realsrc AS
(
    SELECT 
        a.TrackingID,
        leftrow         = a.ab,
        rightrow        = b.ab,
        leftstatus      = a.[Status],
        leftstatusdate  = a.StatusDate,
        rightstatus     = b.[Status],
        rightstatusdate = b.StatusDate 
    FROM src AS a
    LEFT OUTER JOIN src AS b
    ON a.ab = b.ab - 1
)
SELECT 
    Id = ROW_NUMBER() OVER (ORDER BY [leftstatusdate]),
    TrackingID,
    [Status] = leftstatus,
    [StatusDate] = leftstatusdate
FROM
    realsrc
WHERE
    rightrow IS NULL
    OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY 
    [StatusDate];
GO
DROP TABLE #foo;

If you need to support multiple TrackingIDs in the same query:

CREATE TABLE #foo
(
    TrackingID INT,
    [Status] VARCHAR(32),
    StatusDate SMALLDATETIME
);

INSERT #foo SELECT 1, 'PickedUp',  '2010-10-01 08:15';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 03:07';
INSERT #foo SELECT 1, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 1, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 1, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 1, 'AtDest',    '2010-10-04 13:42';
INSERT #foo SELECT 1, 'Deliv',     '2010-10-04 17:05';
INSERT #foo SELECT 2, 'InTransit', '2010-10-02 10:28';
INSERT #foo SELECT 2, 'Delayed',   '2010-10-03 09:52';
INSERT #foo SELECT 2, 'InTransit', '2010-10-03 20:09';
INSERT #foo SELECT 2, 'AtDest',    '2010-10-04 13:42';

WITH src AS
(
    SELECT 
        TrackingID,
        [Status],
        StatusDate, 
        ab = ROW_NUMBER() OVER (ORDER BY [StatusDate])
    FROM #foo
),
realsrc AS
(
    SELECT 
        a.TrackingID,
        leftrow         = a.ab,
        rightrow        = b.ab,
        leftstatus      = a.[Status],
        leftstatusdate  = a.StatusDate,
        rightstatus     = b.[Status],
        rightstatusdate = b.StatusDate 
    FROM src AS a
    LEFT OUTER JOIN src AS b
    ON a.ab = b.ab - 1
    AND a.TrackingID = b.TrackingID
)
SELECT 
    Id = ROW_NUMBER() OVER (ORDER BY TrackingID, [leftstatusdate]),
    TrackingID,
    [Status] = leftstatus,
    [StatusDate] = leftstatusdate
FROM
    realsrc
WHERE
    rightrow IS NULL
    OR (leftrow = rightrow - 1 AND leftstatus <> rightstatus)
ORDER BY 
    TrackingID, 
    [StatusDate];
GO
DROP TABLE #foo;

If this is SQL 2005 then you can use ROW_NUMBER with a sub query or CTE: If the dataset is really huge though and performance is an issue then one of the above that got pasted while I was trying to get the code block to work could well be more efficient.

/**
*  This is just to create a sample table to use in the test query
**/

DECLARE @test TABLE(ID INT, TrackingID INT, Status VARCHAR(20), StatusDate DATETIME)
INSERT    @test
SELECT    1,1,'PickedUp', '01 jan 2010 08:00' UNION
SELECT    2,1,'InTransit', '01 jan 2010 08:01' UNION
SELECT    3,1,'InTransit', '01 jan 2010 08:02' UNION
SELECT    4,1,'Delayed', '01 jan 2010 08:03' UNION
SELECT    5,1,'InTransit', '01 jan 2010 08:04' UNION
SELECT    6,1,'AtDest', '01 jan 2010 08:05' UNION
SELECT    7,1,'Deliv', '01 jan 2010 08:06'


/**
*  This would be the select code to exclude the duplicate entries. 
*  Sorting desc in row_number would get latest instead of first
**/
;WITH n AS
(
    SELECT    ID,
            TrackingID,
            Status,
            StatusDate,
            --For each Status for a tracking ID number by ID (could use date but 2 may be the same)
            ROW_NUMBER() OVER(PARTITION BY TrackingID, Status ORDER BY ID) AS [StatusNumber]
    FROM    @test
)
SELECT    ID,
        TrackingID,
        Status,
        StatusDate
FROM    n
WHERE    StatusNumber = 1
ORDER    BY ID

I think this example will do what you're looking for:

CREATE TABLE dbo.srcStatus (
 Id INT IDENTITY(1,1),
 TrackingId INT NOT NULL,
 [Status] VARCHAR(10) NOT NULL,
 StatusDate DATETIME NOT NULL
);

CREATE TABLE dbo.tgtStatus (
 Id INT IDENTITY(1,1),
 TrackingId INT NOT NULL,
 [Status] VARCHAR(10) NOT NULL,
 StatusDate DATETIME NOT NULL
);

INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'PickedUp','10/1/2010 8:15 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/2/2010 3:07 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'Delayed','10/2/2010 10:28 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'Delayed','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 1,'Deliv','10/4/2010 5:05 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/3/2010 9:52 AM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'InTransit','10/3/2010 8:09 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'AtDest','10/4/2010 1:42 PM');
INSERT INTO dbo.srcStatus ( TrackingId, [Status], StatusDate ) VALUES  ( 2,'Deliv','10/4/2010 5:05 PM');

WITH    cteSrcTrackingIds
          AS ( SELECT DISTINCT
                        TrackingId
               FROM     dbo.srcStatus
             ),
        cteAllTrackingIds
          AS ( SELECT   TrackingId ,
                        [Status] ,
                        StatusDate
               FROM     dbo.srcStatus
               UNION
               SELECT   tgtStatus.TrackingId ,
                        tgtStatuS.[Status] ,
                        tgtStatus.StatusDate
               FROM     cteSrcTrackingIds
                        INNER JOIN dbo.tgtStatus ON cteSrcTrackingIds.TrackingId = tgtStatus.TrackingId
             ),
        cteAllTrackingIdsWithRownums
          AS ( SELECT   TrackingId ,
                        [Status] ,
                        StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY TrackingId ORDER BY StatusDate ) AS rownum
               FROM     cteAllTrackingIds
             ),
        cteTrackingIdsWorkingSet
          AS ( SELECT   src.rownum AS [id] ,
                        src2.rownum AS [id2] ,
                        src.TrackingId ,
                        src.[Status] ,
                        src.StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY src.TrackingId,
                                            src.rownum ORDER BY src.StatusDate ) AS rownum
               FROM     cteAllTrackingIdsWithRownums AS [src]
                        LEFT OUTER JOIN cteAllTrackingIdsWithRownums AS [src2] ON src.TrackingId = src2.TrackingId
                                                              AND src.rownum < src2.rownum
                                                              AND src.[Status] != src2.[Status]
             ),
        cteTrackingIdsSubset
          AS ( SELECT   id ,
                        TrackingId ,
                        [Status] ,
                        StatusDate ,
                        ROW_NUMBER() OVER ( PARTITION BY TrackingId, id2 ORDER BY id ) AS rownum
               FROM     cteTrackingIdsWorkingSet
               WHERE    rownum = 1
             )
    INSERT  INTO dbo.tgtStatus
            ( TrackingId ,
              [status] ,
              StatusDate
            )
            SELECT  cteTrackingIdsSubset.TrackingId ,
                    cteTrackingIdsSubset.[status] ,
                    cteTrackingIdsSubset.StatusDate
            FROM    cteTrackingIdsSubset
                    LEFT OUTER JOIN dbo.tgtStatus ON cteTrackingIdsSubset.TrackingId = tgtStatus.TrackingId
                                                     AND cteTrackingIdsSubset.[status] = tgtStatus.[status]
                                                     AND cteTrackingIdsSubset.StatusDate = tgtStatus.StatusDate
            WHERE   cteTrackingIdsSubset.rownum = 1
                    AND tgtStatus.id IS NULL
            ORDER BY cteTrackingIdsSubset.TrackingId ,
                    cteTrackingIdsSubset.StatusDate;

Need Your Help

How To Get Started Working with the Google AdWords API?

seo google-adwords

I'm a student working on a project related to SEO; I need to write an optimization tool, so I'm writing an application using C# and Windows Forms. The main part of the application is for keyword

jquery dropdown menu cant figure it out

javascript jquery

I cant figure out what is wrong in my code im trying to make a jquery dropdown menu. I have enclosed my html, css and javascript. please take a look and help me out, thank you.