HAVING id IN (SELECT TOP 10 PERCENT ... Question !!

**johnseito** · 06-17-2014, 08:36 PM

Hi I have a SQL query that I don't quite understand why is producing what is producing.

Here is the example of the query statement:

Code:

SELECT tableEXAMPLE.ID, tableEXAMPLE.B, tableEXAMPLE.C, tableEXAMPLE.D, tableEXAMPLE.E
FROM tableEXAMPLE
GROUP BY tableEXAMPLE.ID, tableEXAMPLE.B, tableEXAMPLE.C, tableEXAMPLE.D, tableEXAMPLE.E
HAVING  ((
                    (tableEXAMPLE.ID) IN (SELECT TOP 10 PERCENT ID FROM tableEXAMPLE as q
                    WHERE q.C = tableEXAMPLE.C and t.D = tableEXAMPLE.D ORDER BY id)));

tableEXAMPLE has 1304 records, ID is a unique identifier for all 1304 records. So GROUP BY is
not condensing any of the 1304 records. Field C and D can have duplicated records. After running
the SQL statement I got a total of 195 records as it's result.

My question is why and how the SQL statement on top is different than this one below,
I get a result of 131 records which is 10% of 1304.

How is the top one giving a very different result (195 records) than the bottom, I don't understand even
after I did my own investigation. Could someone help explain this to me. Thanks, appreciate it !!! :-)

Code:

SELECT TOP 10 PERCENT tableEXAMPLE.ID, tableEXAMPLE.B, tableEXAMPLE.C, tableEXAMPLE.D, tableEXAMPLE.E
FROM tableEXAMPLE
GROUP BY tableEXAMPLE.ID, tableEXAMPLE.B, tableEXAMPLE.C, tableEXAMPLE.D, tableEXAMPLE.E

**June7** · 06-18-2014, 12:39 AM

Why use GROUP BY if no aggregation?

Not really understanding the nested SELECT. Baffled how it even works at all. What is the alias t table? Why using table aliases?

**John_G** · 06-18-2014, 10:45 AM

Hi -

Your first example shouldn't work at all - it should give you an error; what is "t" in t.D?

Like June7, I don't quite see the logic there, but the higher-than-expected number of resulting rows is probaly due to the fact that there are duplications in the data.

John

**lfpm062010** · 06-18-2014, 11:09 AM

It like what you say. The field C and D could have duplicate records. So, when you use top 10 on match those criteria. You are selecting up to 10 % duplicate records for one of C and D. My guess those extra 64 records will be the top 10 matching C and D (for each unique ID).

**johnseito** · 06-18-2014, 07:36 PM

Originally Posted by June7

Why use GROUP BY if no aggregation?

Not mine, I am just trying to understand someone's logic in their code.

I guess GROUP BY so they could then using having and nested SELECT statement.

Not really understanding the nested SELECT. Baffled how it even works at all. What is the alias t table? Why using table aliases?

I would say the t alias, one is calling one of the same tables, that is the SELECT TOP 10 PERCENT ID FROM, and one that is from the outer SELECT
FROM.

That is why there is the alias t. using the table twice.

I actually made a minor mistake with the code.

Code:

WHERE q.C = tableEXAMPLE.C and t.D = tableEXAMPLE.D ORDER BY id)));

should be q.D instead

Code:

WHERE q.C = tableEXAMPLE.C and q.D = tableEXAMPLE.D ORDER BY id)));

**johnseito** · 06-18-2014, 07:38 PM

Originally Posted by John_G

Hi -

Your first example shouldn't work at all - it should give you an error; what is "t" in t.D?

Like June7, I don't quite see the logic there, but the higher-than-expected number of resulting rows is probaly due to the fact that there are duplications in the data.

John

Hi John,

Thanks for your response and help in looking at this.

But I actually made a minor mistake with the code.

Code:

WHERE q.C = tableEXAMPLE.C and t.D = tableEXAMPLE.D ORDER BY id)));

should be q.D instead

Code:

WHERE q.C = tableEXAMPLE.C and q.D = tableEXAMPLE.D ORDER BY id)));

Thanks !!!

I don't understand what that code is trying to achieve.

**johnseito** · 06-18-2014, 07:44 PM

Originally Posted by lfpm062010

It like what you say. The field C and D could have duplicate records. So, when you use top 10 on match those criteria. You are selecting up to 10 % duplicate records for one of C and D. My guess those extra 64 records will be the top 10 matching C and D (for each unique ID).

Ok, I put the information on excel and tried to replicate what the SQL statement is doing.

1304 records - 10% 1304 = about 131.

When I replicated it in EXCEL of the same logic behind the SQL statement, I got 138 records.
I don't know what is the logic in the SQL statement to give a result of 195.

**June7** · 06-18-2014, 08:53 PM

Can use nested subqueries without having to make GROUP BY query.

If ID is truly unique in tableEXAMPLE, then I also do not understand the results of the nested query. If you want to provide db for analysis, follow instructions at bottom of my post.

However, if the TOP 10 PERCENT gives what you want, why bother with the nesting?

**johnseito** · 06-22-2014, 07:49 AM

Originally Posted by June7

Can use nested subqueries without having to make GROUP BY query.

Ok, good to know.

If ID is truly unique in tableEXAMPLE, then I also do not understand the results of the nested query. If you want to provide db for analysis, follow instructions at bottom of my post.

exDATABASE.zip
I attached an example. A is unique, B and C have duplicate. I am not sure how it gets 195 items and
without the having ID in, just select the % (Query - 131) has 131 items). Not sure what is the difference.
Hope we could figure it out. thanks !!! :-)

However, if the TOP 10 PERCENT gives what you want, why bother with the nesting?

I don't think is just 10%, is 10% with a condition.

Thanks !!! :-)

**June7** · 06-22-2014, 11:01 AM

The difference in results is due to the WHERE clause in the nested query - which is not in the Query-131 object. As already pointed out, this causes duplications of data. Remove the table alias and WHERE clause from the inner query and the results will be same as Query-131. Why bother with nested query anyway? What are you really trying to accomplish with it?

Note the same result in Query-131 without GROUP BY.

SELECT TOP 10 PERCENT * FROM exampleTABLE;

**johnseito** · 06-22-2014, 07:14 PM

What are you really trying to accomplish with it?

I am not sure what is trying to accomplish too, that is what I am trying to find out, and hence this thread.
I am puzzle why another SELECT statement is used in the HAVING CLAUSE, why it needs that
and that our items are more (you are correct, our result is more because of the where clause)
but why is that even necessary, maybe the SELECT is necessary to get what we want the 195 items and without it, just the 10%
is just wrong.

Do you know in the code
exampleTABLE as t and when WHERE t.B = exampleTABLE.B and t.C = exampleTABLE.C order by id

is the t table (exampleTABLE) within the SELECT statement matching to the first table from the select statement (exampleTABLE) ?
not sure how that logic with the SQL statement is flow.

So we have the first SELECT, it GROUP BY all the fields however because id is unique, the group by doesn't take effect.
so we still have 1304 items.

these items from this table are then matched to the same table (alias) in field B, and C.

My questions is how is the logic flow:
From the alias table exampleTABLE as t, top 10% of it is retrieved, and then it matched it to
the first exampleTABLE of the 1304 records ?

Let's just say that field B is for example the title of a book, and C are the count of that book and id is unique, and field A is unique as
well as it is the identity of each of the books (1304 of them).

**June7** · 06-22-2014, 07:47 PM

I certainly don't know what the developer's intent was. Several of us have attempted to interpret the results and identify how they are generated. Why the query was built, we can't know. Is it used in a report? If it isn't used anywhere, then perhaps the developer was just exploring and analyzing and left the query in place even after the research was completed.

If you don't know the reason the query was built, ignore it or delete. Especially if it returns dataset you have no use for.

**johnseito** · 06-23-2014, 09:53 PM

If you don't know the reason the query was built, ignore it or delete. Especially if it returns dataset you have no use for.

Hi June, I like to follow your suggestion and I think is a great advice. However, the thing is that when I can't figure out something, because I hope to learn from it,
it really bugs me and then I just don't take SQL or the next problem as seriously and I am afraid I can become a slack off.

actually this is not just the sole SQL statement for the project, is just one piece of it. I know what the general goal is,
I am just looking at it step by step to see how and what the developer is doing to achieve the goal. This is the statement that I got stuck in.

one thing that really puzzle me is how it get 195 when I think the most it should have gotten is 138.

The reason I came up with 138 is because after I took out the WHERE clause below:

WHERE t.B =exampleTABLE.B and t.C = exampleTABLE.C Order by id

yes I got a total line item of 131. Then I looked at the table "exampleTABLE" (total of 1304 line items)
and compare it to the 131 result, I only see 7 result that matches to the 131 result of field B and field C
for the 131 (result) and 1304 (original). And we are using the WHERE CLAUSE where t.B = exampleTABLE.B and t.C = exampleTABLE.C.
So if we are doing these matches, I think the result should only be 138 instead.

Thanks and I appreciate you looking at it.
If we can't find the what, how, and why, then I will just move on.
Thanks for your advice. :-)

**June7** · 06-24-2014, 12:05 AM

It's really a very weird query and I am not able to fully understand it. If the nested query is run independently it fails because it can't find exampleTABLE, it only has the alias [t] table. But embedded as nested query it will execute. I have no idea why the developer want to do a self-join, which is what the WHERE clause accomplishes.

The WHERE clause is a filter criteria that wants all records where t.B=exampleTABLE.B AND t.C=exampleTABLE.B. This returns 195 records. If you filter on only one of the criteria the results are:

t.B=exampleTABLE.B returns 195 records

t.C=exampleTABLE.C returns 142 records

The same 195 records of the first filter meet the combined filter.

Look what happens with a self-join query on the B and C fields:

SELECT exampleTABLE.ID, exampleTABLE_1.ID
FROM exampleTABLE AS exampleTABLE_1 INNER JOIN exampleTABLE ON (exampleTABLE_1.C = exampleTABLE.C) AND (exampleTABLE_1.B = exampleTABLE.B);

The result is 35612 records.

The same result with WHERE clause instead of INNER JOIN.

SELECT exampleTABLE.ID
FROM exampleTABLE, exampleTABLE AS exampleTABLE_1
WHERE (((exampleTABLE.B)=[exampleTABLE_1].[B]) AND ((exampleTABLE.C)=[exampleTABLE_1].[C])) ORDER BY exampleTable.ID;

Now add TOP 10 PERCENT and GROUP BY clause and 131 records return.

SELECT TOP 10 PERCENT exampleTABLE.ID
FROM exampleTABLE, exampleTABLE AS exampleTABLE_1
WHERE (((exampleTABLE.B)=[exampleTABLE_1].[B]) AND ((exampleTABLE.C)=[exampleTABLE_1].[C]))
GROUP BY exampleTABLE.ID
ORDER BY exampleTABLE.ID;

So why the self-join? Why the nested? What did developer want to learn from this? No idea. Makes no sense to me.

**hapm** · 06-24-2014, 01:27 AM

The nested query will be evaluated for every record in the normal query. If the record with given B and C is not in the first 10% of all records with the given B and C it will be removed from the query. Only a first guess without executing the query.

HAVING id IN (SELECT TOP 10 PERCENT ... Question !!

Thread Tools

HAVING id IN (SELECT TOP 10 PERCENT ... Question !!

Having Id in SELECT

Similar Threads

percent

Question about a select query.

Select Case Question & Combo Box Question

Simple Combo Box Multiple Select Question

Select query question

Posting Permissions