Optimize spike removal using PostgreSQL












0















I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.



My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.



I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.



So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.



with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;


It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.



Is there any way to accelerate my query?










share|improve this question


















  • 2





    Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

    – 404
    Nov 13 '18 at 12:23


















0















I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.



My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.



I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.



So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.



with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;


It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.



Is there any way to accelerate my query?










share|improve this question


















  • 2





    Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

    – 404
    Nov 13 '18 at 12:23
















0












0








0








I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.



My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.



I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.



So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.



with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;


It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.



Is there any way to accelerate my query?










share|improve this question














I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.



My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.



I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.



So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.



with data_filtered as (
select ts, value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 10:00:00'
and ts < '2018-11-13 10:00:00'
),
avg_and_stddev_per_interval as (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as avg_value,
stddev(value) as stddev_value,
from data_filtered
group by five_min
)
select
time_bucket('5 minutes', ts) as tb,
avg(value) as value,
from data_filtered
left join avg_and_stddev_per_interval
on data_filtered.ts >= avg_and_stddev_per_interval.five_min
and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'
where abs((value-avg_value)/stddev_value) < 1
group by tb;


It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.



Is there any way to accelerate my query?







postgresql optimization timescaledb






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 11:11









TomTom

265




265








  • 2





    Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

    – 404
    Nov 13 '18 at 12:23
















  • 2





    Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

    – 404
    Nov 13 '18 at 12:23










2




2





Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

– 404
Nov 13 '18 at 12:23







Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

– 404
Nov 13 '18 at 12:23














3 Answers
3






active

oldest

votes


















0














The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.





CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;

CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;

SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;





share|improve this answer
























  • This is exactly as fast/slow as the solution without any CTEs/temp views.

    – Tom
    Nov 13 '18 at 15:36











  • In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

    – joop
    Nov 13 '18 at 15:48













  • This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

    – Tom
    Nov 13 '18 at 16:12











  • I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

    – joop
    Nov 13 '18 at 17:40





















0














eurotrash' comment lead to a much faster code as follows:



select 
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;


Here I got rid of any CTEs which were there only for readability anyhow.



This is still 8 times slower than just requesting averaged values without spike removal, though.



Explain analyze:



Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms





share|improve this answer


























  • Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

    – 404
    Nov 13 '18 at 14:44













  • Added explain analyze.

    – Tom
    Nov 13 '18 at 15:42











  • Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

    – 404
    Nov 13 '18 at 16:15











  • The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

    – Tom
    Nov 13 '18 at 16:30













  • OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

    – 404
    Nov 13 '18 at 16:33



















0














It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.



So this is my suggestion:




  1. The subquery results are put into a temp table

  2. The temp table is indexed

  3. The join is performed on the temp table

  4. Encapsulate all this in a function


Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)



So something like this:



CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;

CREATE INDEX ON fm (five_min);

RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;


Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.



I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.






share|improve this answer
























  • Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

    – Tom
    Nov 13 '18 at 17:20













  • Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

    – 404
    Nov 13 '18 at 17:25











  • The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

    – Tom
    Nov 13 '18 at 17:52













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279732%2foptimize-spike-removal-using-postgresql%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.





CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;

CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;

SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;





share|improve this answer
























  • This is exactly as fast/slow as the solution without any CTEs/temp views.

    – Tom
    Nov 13 '18 at 15:36











  • In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

    – joop
    Nov 13 '18 at 15:48













  • This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

    – Tom
    Nov 13 '18 at 16:12











  • I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

    – joop
    Nov 13 '18 at 17:40


















0














The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.





CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;

CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;

SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;





share|improve this answer
























  • This is exactly as fast/slow as the solution without any CTEs/temp views.

    – Tom
    Nov 13 '18 at 15:36











  • In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

    – joop
    Nov 13 '18 at 15:48













  • This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

    – Tom
    Nov 13 '18 at 16:12











  • I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

    – joop
    Nov 13 '18 at 17:40
















0












0








0







The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.





CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;

CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;

SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;





share|improve this answer













The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.





CREATE TEMP VIEW data_filtered as
SELECT ts, value
FROM schema.table
WHERE some_criteria = 42
AND ts >= '2018-11-12 10:00:00'
AND ts < '2018-11-13 10:00:00'
;

CREATE TEMP VIEW avg_and_stddev_per_interval as
SELECT time_bucket('5 minutes', ts) as five_min
, avg(value) as avg_value
, stddev(value) as stddev_value
FROM data_filtered
GROUP BY 1
;

SELECT
time_bucket('5 minutes', ts) as tb
, avg(value) as value
FROM data_filtered df
LEFT JOIN avg_and_stddev_per_interval av
ON df.ts >= av.five_min
AND df.ts < av.five_min + interval '5 minutes'
WHERE abs((value-avg_value)/stddev_value) < 1
GROUP BY 1
;






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 13 '18 at 14:44









joopjoop

3,2581818




3,2581818













  • This is exactly as fast/slow as the solution without any CTEs/temp views.

    – Tom
    Nov 13 '18 at 15:36











  • In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

    – joop
    Nov 13 '18 at 15:48













  • This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

    – Tom
    Nov 13 '18 at 16:12











  • I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

    – joop
    Nov 13 '18 at 17:40





















  • This is exactly as fast/slow as the solution without any CTEs/temp views.

    – Tom
    Nov 13 '18 at 15:36











  • In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

    – joop
    Nov 13 '18 at 15:48













  • This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

    – Tom
    Nov 13 '18 at 16:12











  • I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

    – joop
    Nov 13 '18 at 17:40



















This is exactly as fast/slow as the solution without any CTEs/temp views.

– Tom
Nov 13 '18 at 15:36





This is exactly as fast/slow as the solution without any CTEs/temp views.

– Tom
Nov 13 '18 at 15:36













In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

– joop
Nov 13 '18 at 15:48







In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

– joop
Nov 13 '18 at 15:48















This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

– Tom
Nov 13 '18 at 16:12





This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

– Tom
Nov 13 '18 at 16:12













I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

– joop
Nov 13 '18 at 17:40







I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

– joop
Nov 13 '18 at 17:40















0














eurotrash' comment lead to a much faster code as follows:



select 
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;


Here I got rid of any CTEs which were there only for readability anyhow.



This is still 8 times slower than just requesting averaged values without spike removal, though.



Explain analyze:



Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms





share|improve this answer


























  • Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

    – 404
    Nov 13 '18 at 14:44













  • Added explain analyze.

    – Tom
    Nov 13 '18 at 15:42











  • Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

    – 404
    Nov 13 '18 at 16:15











  • The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

    – Tom
    Nov 13 '18 at 16:30













  • OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

    – 404
    Nov 13 '18 at 16:33
















0














eurotrash' comment lead to a much faster code as follows:



select 
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;


Here I got rid of any CTEs which were there only for readability anyhow.



This is still 8 times slower than just requesting averaged values without spike removal, though.



Explain analyze:



Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms





share|improve this answer


























  • Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

    – 404
    Nov 13 '18 at 14:44













  • Added explain analyze.

    – Tom
    Nov 13 '18 at 15:42











  • Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

    – 404
    Nov 13 '18 at 16:15











  • The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

    – Tom
    Nov 13 '18 at 16:30













  • OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

    – 404
    Nov 13 '18 at 16:33














0












0








0







eurotrash' comment lead to a much faster code as follows:



select 
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;


Here I got rid of any CTEs which were there only for readability anyhow.



This is still 8 times slower than just requesting averaged values without spike removal, though.



Explain analyze:



Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms





share|improve this answer















eurotrash' comment lead to a much faster code as follows:



select 
time_bucket('5 minutes', ts) as tb, avg(value) as value
from schema.table
left join (
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value,
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min
) as fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;


Here I got rid of any CTEs which were there only for readability anyhow.



This is still 8 times slower than just requesting averaged values without spike removal, though.



Explain analyze:



Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)
Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))
Sort Method: quicksort Memory: 65kB
-> HashAggregate (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)
Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))
-> Nested Loop (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)
Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))
Rows Removed by Join Filter: 24770914
-> Append (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)
-> Seq Scan on data (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 949135
-> Materialize (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)
-> Subquery Scan on fm (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)
-> Finalize GroupAggregate (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)
Group Key: (time_bucket('00:05:00'::interval, data_1.ts))
-> Sort (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)
Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))
Sort Method: quicksort Memory: 251kB
-> Gather (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)
Group Key: time_bucket('00:05:00'::interval, data_1.ts)
-> Result (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)
-> Append (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)
-> Parallel Seq Scan on data data_1 (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)
Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))
-> Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1 (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)
Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))
Filter: (node_id = 8)
Rows Removed by Filter: 316378
Planning time: 17.704 ms
Execution time: 24093.223 ms






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 13 '18 at 15:27

























answered Nov 13 '18 at 13:29









TomTom

265




265













  • Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

    – 404
    Nov 13 '18 at 14:44













  • Added explain analyze.

    – Tom
    Nov 13 '18 at 15:42











  • Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

    – 404
    Nov 13 '18 at 16:15











  • The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

    – Tom
    Nov 13 '18 at 16:30













  • OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

    – 404
    Nov 13 '18 at 16:33



















  • Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

    – 404
    Nov 13 '18 at 14:44













  • Added explain analyze.

    – Tom
    Nov 13 '18 at 15:42











  • Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

    – 404
    Nov 13 '18 at 16:15











  • The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

    – Tom
    Nov 13 '18 at 16:30













  • OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

    – 404
    Nov 13 '18 at 16:33

















Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

– 404
Nov 13 '18 at 14:44







Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

– 404
Nov 13 '18 at 14:44















Added explain analyze.

– Tom
Nov 13 '18 at 15:42





Added explain analyze.

– Tom
Nov 13 '18 at 15:42













Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

– 404
Nov 13 '18 at 16:15





Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

– 404
Nov 13 '18 at 16:15













The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

– Tom
Nov 13 '18 at 16:30







The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

– Tom
Nov 13 '18 at 16:30















OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

– 404
Nov 13 '18 at 16:33





OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

– 404
Nov 13 '18 at 16:33











0














It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.



So this is my suggestion:




  1. The subquery results are put into a temp table

  2. The temp table is indexed

  3. The join is performed on the temp table

  4. Encapsulate all this in a function


Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)



So something like this:



CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;

CREATE INDEX ON fm (five_min);

RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;


Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.



I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.






share|improve this answer
























  • Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

    – Tom
    Nov 13 '18 at 17:20













  • Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

    – 404
    Nov 13 '18 at 17:25











  • The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

    – Tom
    Nov 13 '18 at 17:52


















0














It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.



So this is my suggestion:




  1. The subquery results are put into a temp table

  2. The temp table is indexed

  3. The join is performed on the temp table

  4. Encapsulate all this in a function


Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)



So something like this:



CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;

CREATE INDEX ON fm (five_min);

RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;


Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.



I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.






share|improve this answer
























  • Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

    – Tom
    Nov 13 '18 at 17:20













  • Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

    – 404
    Nov 13 '18 at 17:25











  • The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

    – Tom
    Nov 13 '18 at 17:52
















0












0








0







It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.



So this is my suggestion:




  1. The subquery results are put into a temp table

  2. The temp table is indexed

  3. The join is performed on the temp table

  4. Encapsulate all this in a function


Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)



So something like this:



CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;

CREATE INDEX ON fm (five_min);

RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;


Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.



I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.






share|improve this answer













It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.



So this is my suggestion:




  1. The subquery results are put into a temp table

  2. The temp table is indexed

  3. The join is performed on the temp table

  4. Encapsulate all this in a function


Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)



So something like this:



CREATE OR REPLACE FUNCTION schema.my_function()
RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS
$BODY$
BEGIN
CREATE TEMP TABLE fm ON COMMIT DROP AS
select time_bucket('5 minutes', ts) as five_min,
avg(value) as value,
stddev(value) as stddev_value
from schema.table
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
group by five_min;

CREATE INDEX ON fm (five_min);

RETURN time_bucket('5 minutes', ts), avg(value)
from schema.table
left join fm
on ts >= fm.five_min
and ts < fm.five_min + interval '5 minutes'
where some_criteria = 42
and ts >= '2018-11-12 00:00:00'
and ts < '2018-11-13 00:00:00'
and abs((value-avg_value)/stddev_value) < 1
group by tb;
END
$BODY$
LANGUAGE plpgsql;


Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.



I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 13 '18 at 16:48









404404

3,0601626




3,0601626













  • Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

    – Tom
    Nov 13 '18 at 17:20













  • Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

    – 404
    Nov 13 '18 at 17:25











  • The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

    – Tom
    Nov 13 '18 at 17:52





















  • Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

    – Tom
    Nov 13 '18 at 17:20













  • Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

    – 404
    Nov 13 '18 at 17:25











  • The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

    – Tom
    Nov 13 '18 at 17:52



















Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

– Tom
Nov 13 '18 at 17:20







Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

– Tom
Nov 13 '18 at 17:20















Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

– 404
Nov 13 '18 at 17:25





Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

– 404
Nov 13 '18 at 17:25













The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

– Tom
Nov 13 '18 at 17:52







The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

– Tom
Nov 13 '18 at 17:52




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279732%2foptimize-spike-removal-using-postgresql%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

The Sandy Post

Danny Elfman

Pages that link to "Head v. Amoskeag Manufacturing Co."