Optimize spike removal using PostgreSQL

I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.

My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.

I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.

So in a first step I get all data relevant for my analysis (data_filtered), then I calculate average and standard deviation for each 5-minute chunk (avg_and_stddev_per_interval), then I join the initial data (data_filtered) with the calculated avg and stddev, exclude all values not meeting my criteria and finally calculate the final 5-minute average without spikes.

with data_filtered as (

    select ts, value

    from schema.table 

    where some_criteria = 42 

    and ts >= '2018-11-12 10:00:00'

    and ts < '2018-11-13 10:00:00'

), 

avg_and_stddev_per_interval as (

    select time_bucket('5 minutes', ts) as five_min,

    avg(value) as avg_value,

    stddev(value) as stddev_value,

    from data_filtered

    group by five_min   

)

select 

    time_bucket('5 minutes', ts) as tb,

    avg(value) as value,

from data_filtered

left join avg_and_stddev_per_interval 

    on data_filtered.ts >= avg_and_stddev_per_interval.five_min 

    and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'

    where abs((value-avg_value)/stddev_value) < 1 

    group by tb;

It all works well, but it is incredibly slow. Requesting the full data without any grouping (select * from data_filtered) and calculating my criteria locally is much faster. I want to reduce the data volume, however, so this approach is not possible in this case.

Is there any way to accelerate my query?

asked Nov 13 '18 at 11:11

Tom

265

2

Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

– 404
Nov 13 '18 at 12:23

add a comment |

I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.

My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.

I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.

with data_filtered as (

    select ts, value

    from schema.table 

    where some_criteria = 42 

    and ts >= '2018-11-12 10:00:00'

    and ts < '2018-11-13 10:00:00'

), 

avg_and_stddev_per_interval as (

    select time_bucket('5 minutes', ts) as five_min,

    avg(value) as avg_value,

    stddev(value) as stddev_value,

    from data_filtered

    group by five_min   

)

select 

    time_bucket('5 minutes', ts) as tb,

    avg(value) as value,

from data_filtered

left join avg_and_stddev_per_interval 

    on data_filtered.ts >= avg_and_stddev_per_interval.five_min 

    and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'

    where abs((value-avg_value)/stddev_value) < 1 

    group by tb;

Is there any way to accelerate my query?

asked Nov 13 '18 at 11:11

Tom

265

2

Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

– 404
Nov 13 '18 at 12:23

add a comment |

I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.

My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.

I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.

with data_filtered as (

    select ts, value

    from schema.table 

    where some_criteria = 42 

    and ts >= '2018-11-12 10:00:00'

    and ts < '2018-11-13 10:00:00'

), 

avg_and_stddev_per_interval as (

    select time_bucket('5 minutes', ts) as five_min,

    avg(value) as avg_value,

    stddev(value) as stddev_value,

    from data_filtered

    group by five_min   

)

select 

    time_bucket('5 minutes', ts) as tb,

    avg(value) as value,

from data_filtered

left join avg_and_stddev_per_interval 

    on data_filtered.ts >= avg_and_stddev_per_interval.five_min 

    and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'

    where abs((value-avg_value)/stddev_value) < 1 

    group by tb;

Is there any way to accelerate my query?

asked Nov 13 '18 at 11:11

Tom

265

I want to remove spikes directly from my data stored in a PostgreSQL-DB with TimescaleDB.

My data is stored as values with an interval of 1 second, I want to get 5 minute averages calculated without spikes.

I determine spikes using standard deviation and excluding all data that is more than a fixed zscore away.

with data_filtered as (

    select ts, value

    from schema.table 

    where some_criteria = 42 

    and ts >= '2018-11-12 10:00:00'

    and ts < '2018-11-13 10:00:00'

), 

avg_and_stddev_per_interval as (

    select time_bucket('5 minutes', ts) as five_min,

    avg(value) as avg_value,

    stddev(value) as stddev_value,

    from data_filtered

    group by five_min   

)

select 

    time_bucket('5 minutes', ts) as tb,

    avg(value) as value,

from data_filtered

left join avg_and_stddev_per_interval 

    on data_filtered.ts >= avg_and_stddev_per_interval.five_min 

    and data_filtered.ts < avg_and_stddev_per_interval.five_min + interval '5 minutes'

    where abs((value-avg_value)/stddev_value) < 1 

    group by tb;

Is there any way to accelerate my query?

postgresql optimization timescaledb

asked Nov 13 '18 at 11:11

Tom

265

asked Nov 13 '18 at 11:11

Tom

265

asked Nov 13 '18 at 11:11

Tom

265

asked Nov 13 '18 at 11:11

Tom

265

asked Nov 13 '18 at 11:11

Tom

265

2

Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

– 404
Nov 13 '18 at 12:23

add a comment |

2

Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

– 404
Nov 13 '18 at 12:23

Basically you need to get rid of the CTEs. Once you select data into a CTE, from that point there are no more indexes when selecting on those results. Can you not perform the CTE calculations as part of the "main" query on the actual table? E.g. do you really need "data_filtered", can't you just put those conditions in the WHERE clause of the main query?

– 404
Nov 13 '18 at 12:23

add a comment |

3 Answers
3

active

oldest

votes

The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.

CREATE TEMP VIEW data_filtered as

    SELECT ts, value

    FROM schema.table

    WHERE some_criteria = 42

    AND ts >= '2018-11-12 10:00:00'

    AND ts < '2018-11-13 10:00:00'

        ;



CREATE TEMP VIEW avg_and_stddev_per_interval as

    SELECT time_bucket('5 minutes', ts) as five_min

    , avg(value) as avg_value

    , stddev(value) as stddev_value

    FROM data_filtered

    GROUP BY 1

        ;



SELECT

    time_bucket('5 minutes', ts) as tb

    , avg(value) as value

FROM data_filtered df

LEFT JOIN avg_and_stddev_per_interval  av

    ON df.ts >= av.five_min

    AND df.ts < av.five_min + interval '5 minutes'

    WHERE abs((value-avg_value)/stddev_value) < 1

    GROUP BY 1

        ;

answered Nov 13 '18 at 14:44

joop

3,2581818

This is exactly as fast/slow as the solution without any CTEs/temp views.

– Tom
Nov 13 '18 at 15:36

In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

– joop
Nov 13 '18 at 15:48

This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

– Tom
Nov 13 '18 at 16:12

I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

– joop
Nov 13 '18 at 17:40

add a comment |

eurotrash' comment lead to a much faster code as follows:

select 

    time_bucket('5 minutes', ts) as tb, avg(value) as value

from schema.table   

left join (

    select time_bucket('5 minutes', ts) as five_min,

        avg(value) as value,

        stddev(value) as stddev_value,

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min

    ) as fm

    on ts >= fm.five_min 

    and ts < fm.five_min + interval '5 minutes'         

where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

group by tb;

Here I got rid of any CTEs which were there only for readability anyhow.

This is still 8 times slower than just requesting averaged values without spike removal, though.

Explain analyze:

Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)

  Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))

  Sort Method: quicksort  Memory: 65kB

  ->  HashAggregate  (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)

        Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))

        ->  Nested Loop  (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)

              Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))

              Rows Removed by Join Filter: 24770914

              ->  Append  (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)

                    ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)

                          Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                    ->  Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk  (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)

                          Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                          Filter: (node_id = 8)

                          Rows Removed by Filter: 949135

              ->  Materialize  (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)

                    ->  Subquery Scan on fm  (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)

                          ->  Finalize GroupAggregate  (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)

                                Group Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                ->  Sort  (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)

                                      Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                      Sort Method: quicksort  Memory: 251kB

                                      ->  Gather  (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)

                                            Workers Planned: 2

                                            Workers Launched: 2

                                            ->  Partial HashAggregate  (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)

                                                  Group Key: time_bucket('00:05:00'::interval, data_1.ts)

                                                  ->  Result  (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)

                                                        ->  Append  (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)

                                                              ->  Parallel Seq Scan on data data_1  (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)

                                                                    Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                                                              ->  Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1  (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)

                                                                    Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                                                                    Filter: (node_id = 8)

                                                                    Rows Removed by Filter: 316378

Planning time: 17.704 ms

Execution time: 24093.223 ms

edited Nov 13 '18 at 15:27

answered Nov 13 '18 at 13:29

Tom

265

Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

– 404
Nov 13 '18 at 14:44

Added explain analyze.

– Tom
Nov 13 '18 at 15:42

Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

– 404
Nov 13 '18 at 16:15

The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

– Tom
Nov 13 '18 at 16:30

OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

– 404
Nov 13 '18 at 16:33

|
show 2 more comments

It appears the worst of the performance is happening in the JOIN (as per the query in your answer, not in your question). Ideally you wouldn't be joining on a subquery when it returns a lot of results, but I don't see how you can avoid it given your criteria.

So this is my suggestion:

The subquery results are put into a temp table

The temp table is indexed

The join is performed on the temp table

Encapsulate all this in a function

Now I generally hate doing this as I don't like creating temp tables, but sometimes it really does give you the best performance for something that can't be done any other way. (Not saying it can't be done another way, but I can't think of a better performant way.)

So something like this:

CREATE OR REPLACE FUNCTION schema.my_function()

    RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS

$BODY$

BEGIN

    CREATE TEMP TABLE fm ON COMMIT DROP AS

        select time_bucket('5 minutes', ts) as five_min,

            avg(value) as value,

            stddev(value) as stddev_value

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min;



    CREATE INDEX ON fm (five_min);



    RETURN time_bucket('5 minutes', ts), avg(value)

    from schema.table

    left join fm

        on ts >= fm.five_min 

        and ts < fm.five_min + interval '5 minutes'

    where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

    group by tb;

END

$BODY$

    LANGUAGE plpgsql;

Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.

I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.

answered Nov 13 '18 at 16:48

404

3,0601626

Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

– Tom
Nov 13 '18 at 17:20

Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

– 404
Nov 13 '18 at 17:25

The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

– Tom
Nov 13 '18 at 17:52

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279732%2foptimize-spike-removal-using-postgresql%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.

CREATE TEMP VIEW data_filtered as

    SELECT ts, value

    FROM schema.table

    WHERE some_criteria = 42

    AND ts >= '2018-11-12 10:00:00'

    AND ts < '2018-11-13 10:00:00'

        ;



CREATE TEMP VIEW avg_and_stddev_per_interval as

    SELECT time_bucket('5 minutes', ts) as five_min

    , avg(value) as avg_value

    , stddev(value) as stddev_value

    FROM data_filtered

    GROUP BY 1

        ;



SELECT

    time_bucket('5 minutes', ts) as tb

    , avg(value) as value

FROM data_filtered df

LEFT JOIN avg_and_stddev_per_interval  av

    ON df.ts >= av.five_min

    AND df.ts < av.five_min + interval '5 minutes'

    WHERE abs((value-avg_value)/stddev_value) < 1

    GROUP BY 1

        ;

answered Nov 13 '18 at 14:44

joop

3,2581818

This is exactly as fast/slow as the solution without any CTEs/temp views.

– Tom
Nov 13 '18 at 15:36

In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

– joop
Nov 13 '18 at 15:48

This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

– Tom
Nov 13 '18 at 16:12

I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

– joop
Nov 13 '18 at 17:40

add a comment |

The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.

CREATE TEMP VIEW data_filtered as

    SELECT ts, value

    FROM schema.table

    WHERE some_criteria = 42

    AND ts >= '2018-11-12 10:00:00'

    AND ts < '2018-11-13 10:00:00'

        ;



CREATE TEMP VIEW avg_and_stddev_per_interval as

    SELECT time_bucket('5 minutes', ts) as five_min

    , avg(value) as avg_value

    , stddev(value) as stddev_value

    FROM data_filtered

    GROUP BY 1

        ;



SELECT

    time_bucket('5 minutes', ts) as tb

    , avg(value) as value

FROM data_filtered df

LEFT JOIN avg_and_stddev_per_interval  av

    ON df.ts >= av.five_min

    AND df.ts < av.five_min + interval '5 minutes'

    WHERE abs((value-avg_value)/stddev_value) < 1

    GROUP BY 1

        ;

answered Nov 13 '18 at 14:44

joop

3,2581818

This is exactly as fast/slow as the solution without any CTEs/temp views.

– Tom
Nov 13 '18 at 15:36

In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

– joop
Nov 13 '18 at 15:48

This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

– Tom
Nov 13 '18 at 16:12

I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

– joop
Nov 13 '18 at 17:40

add a comment |

The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.

CREATE TEMP VIEW data_filtered as

    SELECT ts, value

    FROM schema.table

    WHERE some_criteria = 42

    AND ts >= '2018-11-12 10:00:00'

    AND ts < '2018-11-13 10:00:00'

        ;



CREATE TEMP VIEW avg_and_stddev_per_interval as

    SELECT time_bucket('5 minutes', ts) as five_min

    , avg(value) as avg_value

    , stddev(value) as stddev_value

    FROM data_filtered

    GROUP BY 1

        ;



SELECT

    time_bucket('5 minutes', ts) as tb

    , avg(value) as value

FROM data_filtered df

LEFT JOIN avg_and_stddev_per_interval  av

    ON df.ts >= av.five_min

    AND df.ts < av.five_min + interval '5 minutes'

    WHERE abs((value-avg_value)/stddev_value) < 1

    GROUP BY 1

        ;

answered Nov 13 '18 at 14:44

joop

3,2581818

The simplest way is to replace the CTE parts by (temp) views. This will allow the optimiser to shuffle and re-assemble the query parts.

CREATE TEMP VIEW data_filtered as

    SELECT ts, value

    FROM schema.table

    WHERE some_criteria = 42

    AND ts >= '2018-11-12 10:00:00'

    AND ts < '2018-11-13 10:00:00'

        ;



CREATE TEMP VIEW avg_and_stddev_per_interval as

    SELECT time_bucket('5 minutes', ts) as five_min

    , avg(value) as avg_value

    , stddev(value) as stddev_value

    FROM data_filtered

    GROUP BY 1

        ;



SELECT

    time_bucket('5 minutes', ts) as tb

    , avg(value) as value

FROM data_filtered df

LEFT JOIN avg_and_stddev_per_interval  av

    ON df.ts >= av.five_min

    AND df.ts < av.five_min + interval '5 minutes'

    WHERE abs((value-avg_value)/stddev_value) < 1

    GROUP BY 1

        ;

answered Nov 13 '18 at 14:44

joop

3,2581818

answered Nov 13 '18 at 14:44

joop

3,2581818

answered Nov 13 '18 at 14:44

joop

3,2581818

answered Nov 13 '18 at 14:44

joop

3,2581818

This is exactly as fast/slow as the solution without any CTEs/temp views.

– Tom
Nov 13 '18 at 15:36

In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

– joop
Nov 13 '18 at 15:48

This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

– Tom
Nov 13 '18 at 16:12

I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

– joop
Nov 13 '18 at 17:40

add a comment |

This is exactly as fast/slow as the solution without any CTEs/temp views.

– Tom
Nov 13 '18 at 15:36

In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

– joop
Nov 13 '18 at 15:48

This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

– Tom
Nov 13 '18 at 16:12

I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

– joop
Nov 13 '18 at 17:40

This is exactly as fast/slow as the solution without any CTEs/temp views.

– Tom
Nov 13 '18 at 15:36

In that case you are either already completely I/O bound, or you lack indexes and/or statistics. BTW: you could win some if the (terrible) where-clause WHERE abs((value-avg_value)/stddev_value) < 1 could be avoided.

– joop
Nov 13 '18 at 15:48

This (terrible) where clause is the one responsible for removing spiky values from my avg-aggregation. Hence the joining etc.

– Tom
Nov 13 '18 at 16:12

I understand that. It is terrible because no indexes can be used, it infers on attributes only. I think a better way would be to create an indexed calendar table with the bins, and join to that, (maybe using window functions). Your bucket function appears to force the plan into a undesirable shape.

– joop
Nov 13 '18 at 17:40

add a comment |

eurotrash' comment lead to a much faster code as follows:

select 

    time_bucket('5 minutes', ts) as tb, avg(value) as value

from schema.table   

left join (

    select time_bucket('5 minutes', ts) as five_min,

        avg(value) as value,

        stddev(value) as stddev_value,

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min

    ) as fm

    on ts >= fm.five_min 

    and ts < fm.five_min + interval '5 minutes'         

where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

group by tb;

Here I got rid of any CTEs which were there only for readability anyhow.

This is still 8 times slower than just requesting averaged values without spike removal, though.

Explain analyze:

Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)

  Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))

  Sort Method: quicksort  Memory: 65kB

  ->  HashAggregate  (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)

        Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))

        ->  Nested Loop  (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)

              Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))

              Rows Removed by Join Filter: 24770914

              ->  Append  (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)

                    ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)

                          Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                    ->  Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk  (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)

                          Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                          Filter: (node_id = 8)

                          Rows Removed by Filter: 949135

              ->  Materialize  (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)

                    ->  Subquery Scan on fm  (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)

                          ->  Finalize GroupAggregate  (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)

                                Group Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                ->  Sort  (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)

                                      Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                      Sort Method: quicksort  Memory: 251kB

                                      ->  Gather  (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)

                                            Workers Planned: 2

                                            Workers Launched: 2

                                            ->  Partial HashAggregate  (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)

                                                  Group Key: time_bucket('00:05:00'::interval, data_1.ts)

                                                  ->  Result  (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)

                                                        ->  Append  (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)

                                                              ->  Parallel Seq Scan on data data_1  (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)

                                                                    Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                                                              ->  Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1  (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)

                                                                    Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                                                                    Filter: (node_id = 8)

                                                                    Rows Removed by Filter: 316378

Planning time: 17.704 ms

Execution time: 24093.223 ms

edited Nov 13 '18 at 15:27

answered Nov 13 '18 at 13:29

Tom

265

Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

– 404
Nov 13 '18 at 14:44

Added explain analyze.

– Tom
Nov 13 '18 at 15:42

Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

– 404
Nov 13 '18 at 16:15

The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

– Tom
Nov 13 '18 at 16:30

OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

– 404
Nov 13 '18 at 16:33

|
show 2 more comments

eurotrash' comment lead to a much faster code as follows:

select 

    time_bucket('5 minutes', ts) as tb, avg(value) as value

from schema.table   

left join (

    select time_bucket('5 minutes', ts) as five_min,

        avg(value) as value,

        stddev(value) as stddev_value,

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min

    ) as fm

    on ts >= fm.five_min 

    and ts < fm.five_min + interval '5 minutes'         

where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

group by tb;

Here I got rid of any CTEs which were there only for readability anyhow.

This is still 8 times slower than just requesting averaged values without spike removal, though.

Explain analyze:

Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)

  Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))

  Sort Method: quicksort  Memory: 65kB

  ->  HashAggregate  (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)

        Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))

        ->  Nested Loop  (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)

              Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))

              Rows Removed by Join Filter: 24770914

              ->  Append  (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)

                    ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)

                          Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                    ->  Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk  (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)

                          Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                          Filter: (node_id = 8)

                          Rows Removed by Filter: 949135

              ->  Materialize  (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)

                    ->  Subquery Scan on fm  (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)

                          ->  Finalize GroupAggregate  (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)

                                Group Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                ->  Sort  (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)

                                      Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                      Sort Method: quicksort  Memory: 251kB

                                      ->  Gather  (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)

                                            Workers Planned: 2

                                            Workers Launched: 2

                                            ->  Partial HashAggregate  (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)

                                                  Group Key: time_bucket('00:05:00'::interval, data_1.ts)

                                                  ->  Result  (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)

                                                        ->  Append  (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)

                                                              ->  Parallel Seq Scan on data data_1  (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)

                                                                    Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                                                              ->  Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1  (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)

                                                                    Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                                                                    Filter: (node_id = 8)

                                                                    Rows Removed by Filter: 316378

Planning time: 17.704 ms

Execution time: 24093.223 ms

edited Nov 13 '18 at 15:27

answered Nov 13 '18 at 13:29

Tom

265

Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

– 404
Nov 13 '18 at 14:44

Added explain analyze.

– Tom
Nov 13 '18 at 15:42

Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

– 404
Nov 13 '18 at 16:15

The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

– Tom
Nov 13 '18 at 16:30

OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

– 404
Nov 13 '18 at 16:33

|
show 2 more comments

eurotrash' comment lead to a much faster code as follows:

select 

    time_bucket('5 minutes', ts) as tb, avg(value) as value

from schema.table   

left join (

    select time_bucket('5 minutes', ts) as five_min,

        avg(value) as value,

        stddev(value) as stddev_value,

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min

    ) as fm

    on ts >= fm.five_min 

    and ts < fm.five_min + interval '5 minutes'         

where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

group by tb;

Here I got rid of any CTEs which were there only for readability anyhow.

This is still 8 times slower than just requesting averaged values without spike removal, though.

Explain analyze:

Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)

  Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))

  Sort Method: quicksort  Memory: 65kB

  ->  HashAggregate  (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)

        Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))

        ->  Nested Loop  (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)

              Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))

              Rows Removed by Join Filter: 24770914

              ->  Append  (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)

                    ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)

                          Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                    ->  Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk  (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)

                          Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                          Filter: (node_id = 8)

                          Rows Removed by Filter: 949135

              ->  Materialize  (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)

                    ->  Subquery Scan on fm  (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)

                          ->  Finalize GroupAggregate  (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)

                                Group Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                ->  Sort  (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)

                                      Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                      Sort Method: quicksort  Memory: 251kB

                                      ->  Gather  (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)

                                            Workers Planned: 2

                                            Workers Launched: 2

                                            ->  Partial HashAggregate  (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)

                                                  Group Key: time_bucket('00:05:00'::interval, data_1.ts)

                                                  ->  Result  (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)

                                                        ->  Append  (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)

                                                              ->  Parallel Seq Scan on data data_1  (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)

                                                                    Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                                                              ->  Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1  (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)

                                                                    Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                                                                    Filter: (node_id = 8)

                                                                    Rows Removed by Filter: 316378

Planning time: 17.704 ms

Execution time: 24093.223 ms

edited Nov 13 '18 at 15:27

answered Nov 13 '18 at 13:29

Tom

265

eurotrash' comment lead to a much faster code as follows:

select 

    time_bucket('5 minutes', ts) as tb, avg(value) as value

from schema.table   

left join (

    select time_bucket('5 minutes', ts) as five_min,

        avg(value) as value,

        stddev(value) as stddev_value,

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min

    ) as fm

    on ts >= fm.five_min 

    and ts < fm.five_min + interval '5 minutes'         

where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

group by tb;

Here I got rid of any CTEs which were there only for readability anyhow.

This is still 8 times slower than just requesting averaged values without spike removal, though.

Explain analyze:

Sort  (cost=844212.16..844212.66 rows=200 width=80) (actual time=24090.495..24090.572 rows=288 loops=1)

  Sort Key: (date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts)))

  Sort Method: quicksort  Memory: 65kB

  ->  HashAggregate  (cost=844200.01..844204.51 rows=200 width=80) (actual time=24089.175..24089.822 rows=288 loops=1)

        Group Key: date_part('epoch'::text, time_bucket('00:05:00'::interval, data.ts))

        ->  Nested Loop  (cost=48033.56..838525.89 rows=226965 width=32) (actual time=792.374..23747.480 rows=79166 loops=1)

              Join Filter: ((data.ts >= fm.five_min) AND (data.ts < (fm.five_min + '00:05:00'::interval)) AND (abs(((data.angle_x - fm.avg_angle_x) / fm.stddev_angle_x)) < '2'::double precision) AND (abs(((data.angle_y - fm.avg_angle_y) / fm.stddev_angle_y)) < '2'::double precision))

              Rows Removed by Join Filter: 24770914

              ->  Append  (cost=0.00..53976.50 rows=91921 width=32) (actual time=0.276..1264.179 rows=86285 loops=1)

                    ->  Seq Scan on data  (cost=0.00..0.00 rows=1 width=32) (actual time=0.027..0.027 rows=0 loops=1)

                          Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                    ->  Index Scan using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk  (cost=0.43..53976.50 rows=91920 width=32) (actual time=0.243..1228.940 rows=86285 loops=1)

                          Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                          Filter: (node_id = 8)

                          Rows Removed by Filter: 949135

              ->  Materialize  (cost=48033.56..48047.06 rows=200 width=40) (actual time=0.010..0.083 rows=288 loops=86285)

                    ->  Subquery Scan on fm  (cost=48033.56..48046.06 rows=200 width=40) (actual time=787.756..791.299 rows=288 loops=1)

                          ->  Finalize GroupAggregate  (cost=48033.56..48044.06 rows=200 width=40) (actual time=787.750..791.071 rows=288 loops=1)

                                Group Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                ->  Sort  (cost=48033.56..48034.56 rows=400 width=136) (actual time=787.680..788.049 rows=853 loops=1)

                                      Sort Key: (time_bucket('00:05:00'::interval, data_1.ts))

                                      Sort Method: quicksort  Memory: 251kB

                                      ->  Gather  (cost=47973.77..48016.27 rows=400 width=136) (actual time=783.341..785.774 rows=853 loops=1)

                                            Workers Planned: 2

                                            Workers Launched: 2

                                            ->  Partial HashAggregate  (cost=46973.77..46976.27 rows=200 width=136) (actual time=758.173..759.378 rows=284 loops=3)

                                                  Group Key: time_bucket('00:05:00'::interval, data_1.ts)

                                                  ->  Result  (cost=0.00..46495.01 rows=38301 width=24) (actual time=0.136..676.873 rows=28762 loops=3)

                                                        ->  Append  (cost=0.00..46016.25 rows=38301 width=24) (actual time=0.131..644.540 rows=28762 loops=3)

                                                              ->  Parallel Seq Scan on data data_1  (cost=0.00..0.00 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=3)

                                                                    Filter: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone) AND (node_id = 8))

                                                              ->  Parallel Index Scan Backward using _hyper_2_22_chunk_data_ts_idx on _hyper_2_22_chunk _hyper_2_22_chunk_1  (cost=0.43..46016.25 rows=38300 width=24) (actual time=0.126..630.920 rows=28762 loops=3)

                                                                    Index Cond: ((ts >= '2018-10-18 11:05:00+02'::timestamp with time zone) AND (ts < '2018-10-19 11:05:00+02'::timestamp with time zone))

                                                                    Filter: (node_id = 8)

                                                                    Rows Removed by Filter: 316378

Planning time: 17.704 ms

Execution time: 24093.223 ms

edited Nov 13 '18 at 15:27

answered Nov 13 '18 at 13:29

Tom

265

edited Nov 13 '18 at 15:27

answered Nov 13 '18 at 13:29

Tom

265

answered Nov 13 '18 at 13:29

Tom

265

answered Nov 13 '18 at 13:29

Tom

265

Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

– 404
Nov 13 '18 at 14:44

Added explain analyze.

– Tom
Nov 13 '18 at 15:42

Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

– 404
Nov 13 '18 at 16:15

The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

– Tom
Nov 13 '18 at 16:30

OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

– 404
Nov 13 '18 at 16:33

|
show 2 more comments

Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

– 404
Nov 13 '18 at 14:44

Added explain analyze.

– Tom
Nov 13 '18 at 15:42

Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

– 404
Nov 13 '18 at 16:15

The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

– Tom
Nov 13 '18 at 16:30

OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

– 404
Nov 13 '18 at 16:33

Can you post an EXPLAIN ANALYZE so we can see how it's being executed?

– 404
Nov 13 '18 at 14:44

Added explain analyze.

– Tom
Nov 13 '18 at 15:42

Can you run the subquery (which is left-joined to) on its own, and the rest of the query on its own (no left join), and just see how fast they are on their own? It seems to me that the join is causing the slowness, so it would be nice to know if each component on its own is relatively fast.

– 404
Nov 13 '18 at 16:15

The subquery finishes in ~3.5s, the rest of the query also ~3.5s. Both combined take ~20s. So it seems that joining is the problem.

– Tom
Nov 13 '18 at 16:30

OK I have a suggestion then which I'll post as an answer as it takes too much text for a comment.

– 404
Nov 13 '18 at 16:33

|
show 2 more comments

So this is my suggestion:

The subquery results are put into a temp table

The temp table is indexed

The join is performed on the temp table

Encapsulate all this in a function

So something like this:

CREATE OR REPLACE FUNCTION schema.my_function()

    RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS

$BODY$

BEGIN

    CREATE TEMP TABLE fm ON COMMIT DROP AS

        select time_bucket('5 minutes', ts) as five_min,

            avg(value) as value,

            stddev(value) as stddev_value

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min;



    CREATE INDEX ON fm (five_min);



    RETURN time_bucket('5 minutes', ts), avg(value)

    from schema.table

    left join fm

        on ts >= fm.five_min 

        and ts < fm.five_min + interval '5 minutes'

    where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

    group by tb;

END

$BODY$

    LANGUAGE plpgsql;

Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.

I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.

answered Nov 13 '18 at 16:48

404

3,0601626

Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

– Tom
Nov 13 '18 at 17:20

Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

– 404
Nov 13 '18 at 17:25

The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

– Tom
Nov 13 '18 at 17:52

add a comment |

So this is my suggestion:

The subquery results are put into a temp table

The temp table is indexed

The join is performed on the temp table

Encapsulate all this in a function

So something like this:

CREATE OR REPLACE FUNCTION schema.my_function()

    RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS

$BODY$

BEGIN

    CREATE TEMP TABLE fm ON COMMIT DROP AS

        select time_bucket('5 minutes', ts) as five_min,

            avg(value) as value,

            stddev(value) as stddev_value

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min;



    CREATE INDEX ON fm (five_min);



    RETURN time_bucket('5 minutes', ts), avg(value)

    from schema.table

    left join fm

        on ts >= fm.five_min 

        and ts < fm.five_min + interval '5 minutes'

    where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

    group by tb;

END

$BODY$

    LANGUAGE plpgsql;

Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.

I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.

answered Nov 13 '18 at 16:48

404

3,0601626

Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

– Tom
Nov 13 '18 at 17:20

Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

– 404
Nov 13 '18 at 17:25

The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

– Tom
Nov 13 '18 at 17:52

add a comment |

So this is my suggestion:

The subquery results are put into a temp table

The temp table is indexed

The join is performed on the temp table

Encapsulate all this in a function

So something like this:

CREATE OR REPLACE FUNCTION schema.my_function()

    RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS

$BODY$

BEGIN

    CREATE TEMP TABLE fm ON COMMIT DROP AS

        select time_bucket('5 minutes', ts) as five_min,

            avg(value) as value,

            stddev(value) as stddev_value

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min;



    CREATE INDEX ON fm (five_min);



    RETURN time_bucket('5 minutes', ts), avg(value)

    from schema.table

    left join fm

        on ts >= fm.five_min 

        and ts < fm.five_min + interval '5 minutes'

    where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

    group by tb;

END

$BODY$

    LANGUAGE plpgsql;

Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.

I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.

answered Nov 13 '18 at 16:48

404

3,0601626

So this is my suggestion:

The subquery results are put into a temp table

The temp table is indexed

The join is performed on the temp table

Encapsulate all this in a function

So something like this:

CREATE OR REPLACE FUNCTION schema.my_function()

    RETURNS TABLE (tb SOMETYPE, avg NUMERIC) AS

$BODY$

BEGIN

    CREATE TEMP TABLE fm ON COMMIT DROP AS

        select time_bucket('5 minutes', ts) as five_min,

            avg(value) as value,

            stddev(value) as stddev_value

        from schema.table

        where some_criteria = 42

        and ts >= '2018-11-12 00:00:00'

        and ts < '2018-11-13 00:00:00'

        group by five_min;



    CREATE INDEX ON fm (five_min);



    RETURN time_bucket('5 minutes', ts), avg(value)

    from schema.table

    left join fm

        on ts >= fm.five_min 

        and ts < fm.five_min + interval '5 minutes'

    where some_criteria = 42

    and ts >= '2018-11-12 00:00:00'

    and ts < '2018-11-13 00:00:00'

    and abs((value-avg_value)/stddev_value) < 1     

    group by tb;

END

$BODY$

    LANGUAGE plpgsql;

Obviously the index I created is simply based on the example in the query you posted, though I see the actual query contains other things, so you'd want to index any fields being joined on.

I called tb's type SOMETYPE because I don't know what type time_bucket returns. And of course you could pass any parts of the query that should be variable as parameters.

answered Nov 13 '18 at 16:48

404

3,0601626

answered Nov 13 '18 at 16:48

404

3,0601626

answered Nov 13 '18 at 16:48

404

3,0601626

answered Nov 13 '18 at 16:48

404

3,0601626

Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

– Tom
Nov 13 '18 at 17:20

Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

– 404
Nov 13 '18 at 17:25

The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

– Tom
Nov 13 '18 at 17:52

add a comment |

Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

– Tom
Nov 13 '18 at 17:20

Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

– 404
Nov 13 '18 at 17:25

The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

– Tom
Nov 13 '18 at 17:52

Is there any advantage wrapping this in a function? Just running the main query (with the left join) on temp tables with a created index is even slower than without temp tables (24s vs. 18s) anyhow

– Tom
Nov 13 '18 at 17:20

Really? Fair enough, I'm surprised though as creating a temp table should not really take much longer than running the query, creating the index should be fast, and the join should be much quicker. But there you go. No, there's no need to wrap it in a function, just for the sake of convenience.

– 404
Nov 13 '18 at 17:25

The temp table & index creations don't take long at all, it is really just the bare query that takes longer.

– Tom
Nov 13 '18 at 17:52

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky