I need help with a query

**ipisors** · 12-20-2015, 03:05 PM

This is a simple-sounding question that escapes me. I am well aware of ways that this can be done using VBA, loops, Dlookups, or dynamic SQL. However, all of those solutions' (at least all I have seen or tested), are horribly slow, and will not be acceptable for my client. (hours of processing). Thus I am looking for a set-based way to query this. This is one of those days I wish to God my client had sql server, as this would be relatively easy, but I have to do this in MS Access.

One, single table: TBL_STAGING. Only two columns are involved with this question. col1 and col7. col1 is text, like this: YYYYMMDD (except it's text, not date). col7 is the name of a stock, like AAPL. Here's what I need to identify: Any stocks (col7) where there is a value in col1 anywhere in the table, but, that value does not exist for col7.

For example: I want to return AAPL if, somewhere in tbl_staging, col1 has a value of 20101215, but not on AAPL records.

That's it! I would also be content, a little less but still content, if I could just take the 20 greatest-sized records from col1 and evaluate those. (In other words, if there are any col7 (stock) where for that particular stock, does not exist any one of the TOP 20 of tbl_staging.col1, ordered desc.

For those about to suggest anything known to be inefficient, let me just say that:

There are 5 million records in TBL_STAGING
There are 3,000 distinct col7 values
There are 1,750 distinct col1 values
TBL_STAGING has no PK nor indexed column. (I realize I would be better off, during querying activities, with a PK, however, in this scenario the app is constantly truncating the table and inserting all 5 million-ish records over again. So I faced the classic dilemma of I could either have a PK and endure horribly slow inserts but faster querying, OR, I could not have a PK and see faster inserts but slower querying. I have never clearly seen one to be better than the other, and in this case I left off the PK so that the 'insert' process would not be so slow).

...You can see why my loops-within-loops have all turned out to be epic failures, time-wise.

If the only solution involves a correlated subquery, then I will accept that, but I'm wondering if there's a better way. Still, if you think the correlated subquery is the best/only way, please suggest some code--I'm experiencing writer's block here, I don't know why.
Any help appreciated. Thanks.

**June7** · 12-20-2015, 03:17 PM

An autonumber field can be used to generate a unique identifier. It does not have to be defined as PK. And if this value is not used as FK anywhere, it doesn't matter that the sequence changes due to deleting and reinserting records. Run Compact & Repair after deleting all records and the seed value should be reset to 1. Does having autonumber field really slow down insert action?

5 million records - what is the db size, anywhere close to 2GB?

**CJ_London** · 12-20-2015, 03:25 PM

I wish to God my client had sql server, as this would be relatively easy, but I have to do this in MS Access.

in principle looks like a simple query requirement easily done in access

it would help if you gave a simple example of what you require since your method seems hopelessly convoluted - how would you achieve this in sql server

In access you would simply have

Code:

SELECT distinct col7
FROM TblStaging as A Left join (select col7 from tblstaging where b.col1="20101215") as B on a.col7=b.col7
WHERE B.col7 is null

also to ensure you have both columns indexed

**ipisors** · 12-20-2015, 03:29 PM

Originally Posted by June7

An autonumber field can be used to generate a unique identifier. It does not have to be defined as PK. And if this value is not used as FK anywhere, it doesn't matter that the sequence changes due to deleting and reinserting records. Run Compact & Repair after deleting all records and the seed value should be reset to 1. Does having autonumber field really slow down insert action?

5 million records - what is the db size, anywhere close to 2GB?

June7:
I agree about an autonumber field being used as a unique identifier, and I agree that the actual value of a unique identifier theoretically should never be used to embed business meaning in, nor should the sequence be even noted or relied upon.
However, in answer to your question - Yes, it is known and I have experienced to tell you that having any indexed field in a database - Access or an industrial RDBMS - very significantly slows down inserts or updates. It is a direct tradeoff. This application already has to loop through a folder on the client's PC and insert records from 3,000 text files from a stock provider. Inserting them all in TBL_STAGING takes long enough - it would be almost intolerable were there an indexed field in TBL_STAGING.

I compact it on each close, the size is 1/3 gb

**CJ_London** · 12-20-2015, 03:39 PM

if you don't have indexing the db can only do a sequential search which will take forever with this number of record.

If you are concerned about time to update, personally uploading 3m+ records does take longer, but not that much longer. However if it is a concern, removing indexing from the table, import your records and then reinstate indexing - it will only take a few seconds

**ipisors** · 12-20-2015, 03:40 PM

Ajax, thanks for your response. I'll respond to each of your comments.
"in principle looks like a simple query requirement easily done in access" - that's great news for me, I look forward to the solution.

As for hopelessly convoluted, I'm not sure what you mean. I don't even HAVE a method yet, so what's hopelessly convoluted? The requirement is actually really simple - I just can't think of how to accomplish it in set-based processing.

Your code: Looks great, but how are you going to replace "20101215" 1,700 times and get new results? That sounds like the looping I was wanting to avoid.

Your code only works for one date. I need to find that result for all of the distinct values - not just 20101215.

I agree on that method, if only your code recursively put out results for not just 20101215, but the other 1,699 distinct values as well.

As for my comment about SQL server? Well because of the processing power, I actually might just choose a looping-based solution, but 5 million records with some loops is not much for sql server, and the processing time would probably be very acceptable to my client - a few minutes, perhaps. Probably a cursor, something like this (forgive syntax errors, I'm not in front of any IDE and this is air code) -

Code:

declare @c cursor
declare @thedate varchar(8)

set @c = cursor for (select distinct col1 from tbl_staging)
select @c into @thedate

while @fetchstatus=0
    begin
        --Here is where a query such as you have proposed would go, if not exist, insert record into errors table, or whatever
    end

close @c
deallocate @c

**ipisors** · 12-20-2015, 03:41 PM

Ok, I'm willing to try the indexing thing you have mentioned.

however, I still haven't seen a set-based solution for my original question. Your code isn't a solution for all distinct values, only the one you hardcoded.

**CJ_London** · 12-20-2015, 05:37 PM

but how are you going to replace "20101215" 1,700 times and get new results?

As mentioned, your description is not clear and I suggested you provide some sample data to demonstrate what you are trying to achieve

you said

I want to return AAPL if, somewhere in tbl_staging, col1 has a value of 20101215, but not on AAPL records.

please expand on this

**ipisors** · 12-20-2015, 06:16 PM

Originally Posted by Ajax

As mentioned, your description is not clear and I suggested you provide some sample data to demonstrate what you are trying to achieve

you said
please expand on this

I like to think there is no requirement in the world that can't be precisely expressed in words and sentences, so if I have failed to do that, I apologize. Here is another try - this is the precise thing I want to return.

- There are 5 million rows in the table
- Each row indicates a Stock (col7) and a Date in col1 (not really Date, rather text, but I'm calling it date, it's the value that is YYYYMMDD in text). An example row would have 20101015 in col1 and AAPL in col7
- There are about 3000 distinct Stocks
- There are about 1700 distinct Dates
- I want to identify all Stocks which contains NO ROWS (not even one) indicating a Date that is present anywhere in the table.
Another way to say this: For every unique Date (col1) existing in the table, I want to identify any situations where there is a Stock that contains NO ROWS with that Date (col1).

The idea in plain English is, if a Date is common to the general data, then I want to identify Stocks missing that date. Not any rows missing that date, for there will be numerous. Just any scenarios where for a particular stock, there are ZERO rows with that date. sample data.zip
I'm willing to create several queries and then a final query that reads the earlier ones - that's OK with me.

I have posted actual sample data in a text file so you can see what I mean. In this example, in the query result, there should be something that indicates that XNCR is missing 20130919. Because 20130919 is one of the dates that exists in the data, except it does not exist where col1=XNCR. Therefore I need to know that there is a stock XNCR which is missing a date that exists in the data somewhere.

**CJ_London** · 12-20-2015, 08:54 PM

Assuming I have understood correctly, try this

Code:

SELECT C.col1, C.col7
FROM (SELECT col1, col7 FROM (SELECT distinct col1 from table1) as A, 
     (SELECT distinct col7 from table1) as B)  AS C 
         LEFT JOIN Table1 ON (C.col7 = Table1.col7) AND (C.col1 = Table1.col1)
WHERE (((Table1.col1) Is Null))

in your sample data you have 3 values for col7 - XLRN, XNCR, XNPT
for col 1 value 20130919 you have XLRN, XNPT. The query returns XNCR

**orange** · 12-21-2015, 07:28 AM

This may be similar to Ajax's response.
I created a table Cols17 which contained the original data.
The processed cols17 to put the distinct col1 into a table Col1Distinct, and then the distinct col7 into a table DistinctCol7.

Then a cartesian product of Col1Distinct, Col7Distinct ---this gives all possible combinations.

Then removed all the combinations that you had originally

Click image for larger version.

Name: Query.jpg
Views: 17
Size: 15.2 KB
ID: 23122

Hopefully this leaves all combinations that were missing in the original data.

Code:

select col1, col7 from
(SELECT col1Distinct.col1, col7Distinct.col7
FROM col1Distinct, col7Distinct)
where col1&col7 not in
(select col1&col7 from cols17)

**CJ_London** · 12-21-2015, 07:53 AM

yup similar

**ipisors** · 12-21-2015, 11:54 AM

Ajax and Orange:
Your replies and help are greatly appreciated. I'm humbled that I had a mental block to producing this code. I will be able to test this tonight, but it looks like what I needed.

**CJ_London** · 12-21-2015, 12:42 PM

don't forget to index your fields - and try both methods - will be interested to know which is faster

**ipisors** · 12-21-2015, 01:37 PM

Yes, I will. I'm grateful that MS Access does support some DDL, I believe I will drop and re-create the table without indexes, do the insert, then alter it, via DAO/VBA.
Thanks again, will report back w/results.

I need help with a query

Thread Tools

I need help with a query

Posting Permissions