Optimize spike removal using PostgreSQL
I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.
My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.
I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.
So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.
with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;
It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.
Is there any way to accelerate my query?
postgresql optimization timescaledb
add a comment |
I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.
My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.
I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.
So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.
with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;
It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.
Is there any way to accelerate my query?
postgresql optimization timescaledb
2
Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?
– 404
Nov 13 '18 at 12:23
add a comment |
I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.
My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.
I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.
So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.
with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;
It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.
Is there any way to accelerate my query?
postgresql optimization timescaledb
I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.
My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.
I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.
So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.
with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;
It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.
Is there any way to accelerate my query?
postgresql optimization timescaledb
postgresql optimization timescaledb
asked Nov 13 '18 at 11:11
TomTom
265
265
2
Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?
– 404
Nov 13 '18 at 12:23
add a comment |
2
Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?
– 404
Nov 13 '18 at 12:23
2
2
Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?
– 404
Nov 13 '18 at 12:23
Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?
– 404
Nov 13 '18 at 12:23
add a comment |
3 Answers
3
active
oldest
votes
The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.
CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;
CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;
SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;
This is exactly as fast/slow as the solution without any CTEs/temp views.
– Tom
Nov 13 '18 at 15:36
In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clauseWHERE abs((value-avg_value)/stddev_value) < 1could be avoided.
– joop
Nov 13 '18 at 15:48
This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.
– Tom
Nov 13 '18 at 16:12
I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.
– joop
Nov 13 '18 at 17:40
add a comment |
eurotrash' comment lead to a much faster code as follows:
select
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
Here I got rid of any CTEs which were there only for readability anyhow.
This is still 8 times slower than just requesting averaged values without spike removal, though.
Explain analyze:
Sort (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms
Can you post anEXPLAIN ANALYZEso we can see how it's being executed?
– 404
Nov 13 '18 at 14:44
Added explain analyze.
– Tom
Nov 13 '18 at 15:42
Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.
– 404
Nov 13 '18 at 16:15
The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.
– Tom
Nov 13 '18 at 16:30
OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.
– 404
Nov 13 '18 at 16:33
|
show 2 more comments
It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.
So this is my suggestion:
- The subquery results are put into a temp table
- The temp table is indexed
- The join is performed on the temp table
- Encapsulate all this in a function
Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)
So something like this:
CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;
CREATE INDEX ON fm (five_min);
RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;
Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.
I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.
Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow
– Tom
Nov 13 '18 at 17:20
Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.
– 404
Nov 13 '18 at 17:25
The temp table & index creations don't take long at all, it is really just the bare query that takes longer.
– Tom
Nov 13 '18 at 17:52
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279732%2foptimize-spike-removal-using-postgresql%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.
CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;
CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;
SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;
This is exactly as fast/slow as the solution without any CTEs/temp views.
– Tom
Nov 13 '18 at 15:36
In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clauseWHERE abs((value-avg_value)/stddev_value) < 1could be avoided.
– joop
Nov 13 '18 at 15:48
This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.
– Tom
Nov 13 '18 at 16:12
I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.
– joop
Nov 13 '18 at 17:40
add a comment |
The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.
CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;
CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;
SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;
This is exactly as fast/slow as the solution without any CTEs/temp views.
– Tom
Nov 13 '18 at 15:36
In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clauseWHERE abs((value-avg_value)/stddev_value) < 1could be avoided.
– joop
Nov 13 '18 at 15:48
This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.
– Tom
Nov 13 '18 at 16:12
I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.
– joop
Nov 13 '18 at 17:40
add a comment |
The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.
CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;
CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;
SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;
The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.
CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;
CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;
SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;
answered Nov 13 '18 at 14:44
joopjoop
3,2581818
3,2581818
This is exactly as fast/slow as the solution without any CTEs/temp views.
– Tom
Nov 13 '18 at 15:36
In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clauseWHERE abs((value-avg_value)/stddev_value) < 1could be avoided.
– joop
Nov 13 '18 at 15:48
This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.
– Tom
Nov 13 '18 at 16:12
I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.
– joop
Nov 13 '18 at 17:40
add a comment |
This is exactly as fast/slow as the solution without any CTEs/temp views.
– Tom
Nov 13 '18 at 15:36
In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clauseWHERE abs((value-avg_value)/stddev_value) < 1could be avoided.
– joop
Nov 13 '18 at 15:48
This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.
– Tom
Nov 13 '18 at 16:12
I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.
– joop
Nov 13 '18 at 17:40
This is exactly as fast/slow as the solution without any CTEs/temp views.
– Tom
Nov 13 '18 at 15:36
This is exactly as fast/slow as the solution without any CTEs/temp views.
– Tom
Nov 13 '18 at 15:36
In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause
WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.– joop
Nov 13 '18 at 15:48
In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause
WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.– joop
Nov 13 '18 at 15:48
This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.
– Tom
Nov 13 '18 at 16:12
This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.
– Tom
Nov 13 '18 at 16:12
I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.
– joop
Nov 13 '18 at 17:40
I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.
– joop
Nov 13 '18 at 17:40
add a comment |
eurotrash' comment lead to a much faster code as follows:
select
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
Here I got rid of any CTEs which were there only for readability anyhow.
This is still 8 times slower than just requesting averaged values without spike removal, though.
Explain analyze:
Sort (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms
Can you post anEXPLAIN ANALYZEso we can see how it's being executed?
– 404
Nov 13 '18 at 14:44
Added explain analyze.
– Tom
Nov 13 '18 at 15:42
Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.
– 404
Nov 13 '18 at 16:15
The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.
– Tom
Nov 13 '18 at 16:30
OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.
– 404
Nov 13 '18 at 16:33
|
show 2 more comments
eurotrash' comment lead to a much faster code as follows:
select
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
Here I got rid of any CTEs which were there only for readability anyhow.
This is still 8 times slower than just requesting averaged values without spike removal, though.
Explain analyze:
Sort (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms
Can you post anEXPLAIN ANALYZEso we can see how it's being executed?
– 404
Nov 13 '18 at 14:44
Added explain analyze.
– Tom
Nov 13 '18 at 15:42
Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.
– 404
Nov 13 '18 at 16:15
The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.
– Tom
Nov 13 '18 at 16:30
OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.
– 404
Nov 13 '18 at 16:33
|
show 2 more comments
eurotrash' comment lead to a much faster code as follows:
select
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
Here I got rid of any CTEs which were there only for readability anyhow.
This is still 8 times slower than just requesting averaged values without spike removal, though.
Explain analyze:
Sort (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms
eurotrash' comment lead to a much faster code as follows:
select
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
Here I got rid of any CTEs which were there only for readability anyhow.
This is still 8 times slower than just requesting averaged values without spike removal, though.
Explain analyze:
Sort (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms
edited Nov 13 '18 at 15:27
answered Nov 13 '18 at 13:29
TomTom
265
265
Can you post anEXPLAIN ANALYZEso we can see how it's being executed?
– 404
Nov 13 '18 at 14:44
Added explain analyze.
– Tom
Nov 13 '18 at 15:42
Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.
– 404
Nov 13 '18 at 16:15
The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.
– Tom
Nov 13 '18 at 16:30
OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.
– 404
Nov 13 '18 at 16:33
|
show 2 more comments
Can you post anEXPLAIN ANALYZEso we can see how it's being executed?
– 404
Nov 13 '18 at 14:44
Added explain analyze.
– Tom
Nov 13 '18 at 15:42
Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.
– 404
Nov 13 '18 at 16:15
The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.
– Tom
Nov 13 '18 at 16:30
OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.
– 404
Nov 13 '18 at 16:33
Can you post an
EXPLAIN ANALYZE so we can see how it's being executed?– 404
Nov 13 '18 at 14:44
Can you post an
EXPLAIN ANALYZE so we can see how it's being executed?– 404
Nov 13 '18 at 14:44
Added explain analyze.
– Tom
Nov 13 '18 at 15:42
Added explain analyze.
– Tom
Nov 13 '18 at 15:42
Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.
– 404
Nov 13 '18 at 16:15
Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.
– 404
Nov 13 '18 at 16:15
The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.
– Tom
Nov 13 '18 at 16:30
The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.
– Tom
Nov 13 '18 at 16:30
OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.
– 404
Nov 13 '18 at 16:33
OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.
– 404
Nov 13 '18 at 16:33
|
show 2 more comments
It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.
So this is my suggestion:
- The subquery results are put into a temp table
- The temp table is indexed
- The join is performed on the temp table
- Encapsulate all this in a function
Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)
So something like this:
CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;
CREATE INDEX ON fm (five_min);
RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;
Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.
I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.
Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow
– Tom
Nov 13 '18 at 17:20
Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.
– 404
Nov 13 '18 at 17:25
The temp table & index creations don't take long at all, it is really just the bare query that takes longer.
– Tom
Nov 13 '18 at 17:52
add a comment |
It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.
So this is my suggestion:
- The subquery results are put into a temp table
- The temp table is indexed
- The join is performed on the temp table
- Encapsulate all this in a function
Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)
So something like this:
CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;
CREATE INDEX ON fm (five_min);
RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;
Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.
I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.
Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow
– Tom
Nov 13 '18 at 17:20
Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.
– 404
Nov 13 '18 at 17:25
The temp table & index creations don't take long at all, it is really just the bare query that takes longer.
– Tom
Nov 13 '18 at 17:52
add a comment |
It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.
So this is my suggestion:
- The subquery results are put into a temp table
- The temp table is indexed
- The join is performed on the temp table
- Encapsulate all this in a function
Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)
So something like this:
CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;
CREATE INDEX ON fm (five_min);
RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;
Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.
I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.
It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.
So this is my suggestion:
- The subquery results are put into a temp table
- The temp table is indexed
- The join is performed on the temp table
- Encapsulate all this in a function
Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)
So something like this:
CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;
CREATE INDEX ON fm (five_min);
RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;
Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.
I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.
answered Nov 13 '18 at 16:48
404404
3,0601626
3,0601626
Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow
– Tom
Nov 13 '18 at 17:20
Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.
– 404
Nov 13 '18 at 17:25
The temp table & index creations don't take long at all, it is really just the bare query that takes longer.
– Tom
Nov 13 '18 at 17:52
add a comment |
Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow
– Tom
Nov 13 '18 at 17:20
Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.
– 404
Nov 13 '18 at 17:25
The temp table & index creations don't take long at all, it is really just the bare query that takes longer.
– Tom
Nov 13 '18 at 17:52
Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow
– Tom
Nov 13 '18 at 17:20
Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow
– Tom
Nov 13 '18 at 17:20
Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.
– 404
Nov 13 '18 at 17:25
Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.
– 404
Nov 13 '18 at 17:25
The temp table & index creations don't take long at all, it is really just the bare query that takes longer.
– Tom
Nov 13 '18 at 17:52
The temp table & index creations don't take long at all, it is really just the bare query that takes longer.
– Tom
Nov 13 '18 at 17:52
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279732%2foptimize-spike-removal-using-postgresql%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?
– 404
Nov 13 '18 at 12:23