PostgreSQL: Correct way for subset calculations
I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id
and activity_id
like this
user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
I want to select all user_id
with activity_id = 1, 2 NOT 3
. In the case above it will be just one result: user_id = 2
. I can do it using SELECT DISTINCT
combined with INTERSECT
and EXCEPT
operators, but it seems to be extremely slow.
From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).
postgresql
New contributor
add a comment |
I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id
and activity_id
like this
user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
I want to select all user_id
with activity_id = 1, 2 NOT 3
. In the case above it will be just one result: user_id = 2
. I can do it using SELECT DISTINCT
combined with INTERSECT
and EXCEPT
operators, but it seems to be extremely slow.
From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).
postgresql
New contributor
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
11 hours ago
@a_horse_with_no_nameactivity_id=4
changes nothing in such a case. It's 10.000.000 rows.
– Ximik
11 hours ago
When you sayactivity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?
– Evan Carroll
9 hours ago
add a comment |
I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id
and activity_id
like this
user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
I want to select all user_id
with activity_id = 1, 2 NOT 3
. In the case above it will be just one result: user_id = 2
. I can do it using SELECT DISTINCT
combined with INTERSECT
and EXCEPT
operators, but it seems to be extremely slow.
From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).
postgresql
New contributor
I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id
and activity_id
like this
user_id | activity_id
---------------------
1 | 1
1 | 2
1 | 3
2 | 1
2 | 2
I want to select all user_id
with activity_id = 1, 2 NOT 3
. In the case above it will be just one result: user_id = 2
. I can do it using SELECT DISTINCT
combined with INTERSECT
and EXCEPT
operators, but it seems to be extremely slow.
From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).
postgresql
postgresql
New contributor
New contributor
edited 9 hours ago
Evan Carroll
31.5k865210
31.5k865210
New contributor
asked 13 hours ago
XimikXimik
1063
1063
New contributor
New contributor
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
11 hours ago
@a_horse_with_no_nameactivity_id=4
changes nothing in such a case. It's 10.000.000 rows.
– Ximik
11 hours ago
When you sayactivity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?
– Evan Carroll
9 hours ago
add a comment |
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
11 hours ago
@a_horse_with_no_nameactivity_id=4
changes nothing in such a case. It's 10.000.000 rows.
– Ximik
11 hours ago
When you sayactivity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?
– Evan Carroll
9 hours ago
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
11 hours ago
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
11 hours ago
@a_horse_with_no_name
activity_id=4
changes nothing in such a case. It's 10.000.000 rows.– Ximik
11 hours ago
@a_horse_with_no_name
activity_id=4
changes nothing in such a case. It's 10.000.000 rows.– Ximik
11 hours ago
When you say
activity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?– Evan Carroll
9 hours ago
When you say
activity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?– Evan Carroll
9 hours ago
add a comment |
3 Answers
3
active
oldest
votes
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
add a comment |
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
add a comment |
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3))
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Ximik is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
add a comment |
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
add a comment |
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
You can easily do this with arrays in Postgres:
select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
and not 3 = any(array_agg(activity_id));
The condition array_agg(activity_id) @> array[1,2]
only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id))
removes all those that contain activity_id = 3
If the table contains more than just those two columns, an index on (user_id, activitiy_id)
will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists
condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id)
to enable Postgres to remove unwanted rows efficiently.
On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.
Online example: https://rextester.com/YLN7221
edited 12 hours ago
answered 12 hours ago
a_horse_with_no_namea_horse_with_no_name
38.9k775112
38.9k775112
add a comment |
add a comment |
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
add a comment |
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
add a comment |
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
You can first try to rewrite the query using EXISTS
and a regular (B-tree) index on user_id
and activity_id
.
CREATE INDEX elbat_user_id_activity_id
ON elbat (user_id,
activity_id);
SELECT DISTINCT t1.user_id
FROM elbat t1
WHERE EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '1')
AND EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '2')
AND NOT EXISTS (SELECT *
FROM elbat t2
WHERE t2.user_id = t1.user_id
AND t2.activity_id = '3');
If you have a user table, you might also want to join to that instead of retrieving the distinct user ID from the other table.
answered 12 hours ago
sticky bitsticky bit
1,733314
1,733314
add a comment |
add a comment |
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3))
add a comment |
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3))
add a comment |
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3))
You can use bool_or
for this.
SELECT user_id
FROM users
GROUP BY user_id
HAVING bool_or(activity_id IN (1,2)) -- assumes '1, 2' mean '1 OR 2'
AND NOT bool_or(activity_id IN (3))
answered 9 hours ago
Evan CarrollEvan Carroll
31.5k865210
31.5k865210
add a comment |
add a comment |
Ximik is a new contributor. Be nice, and check out our Code of Conduct.
Ximik is a new contributor. Be nice, and check out our Code of Conduct.
Ximik is a new contributor. Be nice, and check out our Code of Conduct.
Ximik is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227083%2fpostgresql-correct-way-for-subset-calculations%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Please post your table(s) definition(s), query, and execution plan.
– Colin 't Hart
11 hours ago
@a_horse_with_no_name
activity_id=4
changes nothing in such a case. It's 10.000.000 rows.– Ximik
11 hours ago
When you say
activity id = 1, 2
do you mean 1 AND 2, or 1 OR 2?– Evan Carroll
9 hours ago