Random access to the specified amount of data, is there any better way?











up vote
0
down vote

favorite
2












MySQL Get random bar data



Scenes



There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.



Suppose there is a data table



sql
Create table topic (
  Id int primary key not null
  Comment 'number',
  Content varchar(20) not null
  Comment 'content'
)
  Comment 'topic table';




The topic table here has two key features
- Primary key can be compared (int)
- There is a trend in the overall primary key (self-increase/decrease)




Solution 1: Directly use order by rand()



You can get random data directly by using order by rand(), and you can get all the data (the order is still random).




  1. According to the result of rand()
        > This step is equivalent to adding a column of data generated by the rand() function to each data, and then sorting the column

  2. Limit the number of queries


sql
    Select *
    From topic
    Order by rand()
    Limit 50000;



But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.



Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.



In fact, order by rand() looks strange, actually equivalent to:



sql
    Select *
    From
          Select
            Topic.*,
            Rand() as order_column
          From topic
        ) as temp
    Order by order_column
    Limit 50000;



Solution 2: Use where to take the middle random value



Since the ordering caused by order by rand() without indexing is too time consuming, we can try to get around this problem.



The following solution is like this




  1. Take a random value between the minimum and maximum values

  2. Determine if the id is greater than (or less than) this random value

  3. Limit the number of queries


sql
Select *
From topic
Where id >= ((select max(id)
              From topic)
            - (select min(id)
                From topic))
            * rand()
            + (select min(id)
              From topic)
Limit 50000;



This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.



So, here's the defect of the method





  • The acquired data is affected by the distribution density



      For example, the data distribution is as follows



      1,100002,100003,100004...199999,200000



      Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change >= to <=, then the average number that can be obtained will be greatly increased (about 7.5w).



          




    The code formatting here has been in error and I can't solve it. . .



    Select *
          From topic
          # Note: The symbols here have been modified.
          Where id >= ((select max(id)
                        From topic)
                      - (select min(id)
                          From topic))
                      * rand()
                      + (select min(id)
                        From topic)
          Limit 50000;




  • The probability of each piece of data is not exactly the same
      Although all the data obtained is random, the probability of each is not the same. For example, when <=, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to >=, the first piece of data obtained is generally too small.
      Use the result of >=
      - The more data is in front, the lower the probability of getting it
      - But even with very low probability, there is always a chance at the top, so the first one is generally small
      - When the data density is too large, the number obtained will be very small



The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2, otherwise it will deviate more (not necessarily too large or too small).



Solution 3: Using the temporary table temporary



Solution 2 Focus on avoiding sorting with rand() without indexing, but think about another solution here, sorting with the added rand() after indexing. Create a temporary table containing only the primary key id and the index column randomId that needs to be sorted, and then get the out-of-order data after the sorting is completed.



sql
    Drop temporary table if exists temp_topic;
    Create temporary table temp_topic (
      Id bigint primary key not null,
      randomId double not null,
      Index (randomId)
    )
      As
        Select
          Id,
          Rand() as randomId
        From topic;
    Select t.*
    From topic t
      Join (
            Select id
            From
                    Select id
                    From temp_topic
                    Order by randomId
                  ) as temp
            Limit 50000
          ) as temp
        On t.id = temp.id;



The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.



to sum up




Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .




| Differences | order by rand() | where | temporary |
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed ​​ | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed ​​is affected by table data complexity | Very large | Very small | Small |










share|improve this question
























  • Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
    – Solarflare
    Nov 10 at 9:48












  • @Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
    – rxliuli
    Nov 10 at 14:49










  • If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
    – Solarflare
    Nov 10 at 16:14












  • Does it work any faster if you do select * from topic where id in (select id from topic order by rand() limit 1000)? BTW you're selecting 50% of rows... is that correct?
    – Salman A
    Nov 12 at 14:01












  • First of all, sql select * from topic where id in (select id from topic order by rand() limit 1000) can't run, and the speed of using order by rand() will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
    – rxliuli
    Nov 13 at 19:45















up vote
0
down vote

favorite
2












MySQL Get random bar data



Scenes



There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.



Suppose there is a data table



sql
Create table topic (
  Id int primary key not null
  Comment 'number',
  Content varchar(20) not null
  Comment 'content'
)
  Comment 'topic table';




The topic table here has two key features
- Primary key can be compared (int)
- There is a trend in the overall primary key (self-increase/decrease)




Solution 1: Directly use order by rand()



You can get random data directly by using order by rand(), and you can get all the data (the order is still random).




  1. According to the result of rand()
        > This step is equivalent to adding a column of data generated by the rand() function to each data, and then sorting the column

  2. Limit the number of queries


sql
    Select *
    From topic
    Order by rand()
    Limit 50000;



But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.



Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.



In fact, order by rand() looks strange, actually equivalent to:



sql
    Select *
    From
          Select
            Topic.*,
            Rand() as order_column
          From topic
        ) as temp
    Order by order_column
    Limit 50000;



Solution 2: Use where to take the middle random value



Since the ordering caused by order by rand() without indexing is too time consuming, we can try to get around this problem.



The following solution is like this




  1. Take a random value between the minimum and maximum values

  2. Determine if the id is greater than (or less than) this random value

  3. Limit the number of queries


sql
Select *
From topic
Where id >= ((select max(id)
              From topic)
            - (select min(id)
                From topic))
            * rand()
            + (select min(id)
              From topic)
Limit 50000;



This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.



So, here's the defect of the method





  • The acquired data is affected by the distribution density



      For example, the data distribution is as follows



      1,100002,100003,100004...199999,200000



      Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change >= to <=, then the average number that can be obtained will be greatly increased (about 7.5w).



          




    The code formatting here has been in error and I can't solve it. . .



    Select *
          From topic
          # Note: The symbols here have been modified.
          Where id >= ((select max(id)
                        From topic)
                      - (select min(id)
                          From topic))
                      * rand()
                      + (select min(id)
                        From topic)
          Limit 50000;




  • The probability of each piece of data is not exactly the same
      Although all the data obtained is random, the probability of each is not the same. For example, when <=, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to >=, the first piece of data obtained is generally too small.
      Use the result of >=
      - The more data is in front, the lower the probability of getting it
      - But even with very low probability, there is always a chance at the top, so the first one is generally small
      - When the data density is too large, the number obtained will be very small



The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2, otherwise it will deviate more (not necessarily too large or too small).



Solution 3: Using the temporary table temporary



Solution 2 Focus on avoiding sorting with rand() without indexing, but think about another solution here, sorting with the added rand() after indexing. Create a temporary table containing only the primary key id and the index column randomId that needs to be sorted, and then get the out-of-order data after the sorting is completed.



sql
    Drop temporary table if exists temp_topic;
    Create temporary table temp_topic (
      Id bigint primary key not null,
      randomId double not null,
      Index (randomId)
    )
      As
        Select
          Id,
          Rand() as randomId
        From topic;
    Select t.*
    From topic t
      Join (
            Select id
            From
                    Select id
                    From temp_topic
                    Order by randomId
                  ) as temp
            Limit 50000
          ) as temp
        On t.id = temp.id;



The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.



to sum up




Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .




| Differences | order by rand() | where | temporary |
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed ​​ | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed ​​is affected by table data complexity | Very large | Very small | Small |










share|improve this question
























  • Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
    – Solarflare
    Nov 10 at 9:48












  • @Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
    – rxliuli
    Nov 10 at 14:49










  • If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
    – Solarflare
    Nov 10 at 16:14












  • Does it work any faster if you do select * from topic where id in (select id from topic order by rand() limit 1000)? BTW you're selecting 50% of rows... is that correct?
    – Salman A
    Nov 12 at 14:01












  • First of all, sql select * from topic where id in (select id from topic order by rand() limit 1000) can't run, and the speed of using order by rand() will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
    – rxliuli
    Nov 13 at 19:45













up vote
0
down vote

favorite
2









up vote
0
down vote

favorite
2






2





MySQL Get random bar data



Scenes



There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.



Suppose there is a data table



sql
Create table topic (
  Id int primary key not null
  Comment 'number',
  Content varchar(20) not null
  Comment 'content'
)
  Comment 'topic table';




The topic table here has two key features
- Primary key can be compared (int)
- There is a trend in the overall primary key (self-increase/decrease)




Solution 1: Directly use order by rand()



You can get random data directly by using order by rand(), and you can get all the data (the order is still random).




  1. According to the result of rand()
        > This step is equivalent to adding a column of data generated by the rand() function to each data, and then sorting the column

  2. Limit the number of queries


sql
    Select *
    From topic
    Order by rand()
    Limit 50000;



But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.



Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.



In fact, order by rand() looks strange, actually equivalent to:



sql
    Select *
    From
          Select
            Topic.*,
            Rand() as order_column
          From topic
        ) as temp
    Order by order_column
    Limit 50000;



Solution 2: Use where to take the middle random value



Since the ordering caused by order by rand() without indexing is too time consuming, we can try to get around this problem.



The following solution is like this




  1. Take a random value between the minimum and maximum values

  2. Determine if the id is greater than (or less than) this random value

  3. Limit the number of queries


sql
Select *
From topic
Where id >= ((select max(id)
              From topic)
            - (select min(id)
                From topic))
            * rand()
            + (select min(id)
              From topic)
Limit 50000;



This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.



So, here's the defect of the method





  • The acquired data is affected by the distribution density



      For example, the data distribution is as follows



      1,100002,100003,100004...199999,200000



      Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change >= to <=, then the average number that can be obtained will be greatly increased (about 7.5w).



          




    The code formatting here has been in error and I can't solve it. . .



    Select *
          From topic
          # Note: The symbols here have been modified.
          Where id >= ((select max(id)
                        From topic)
                      - (select min(id)
                          From topic))
                      * rand()
                      + (select min(id)
                        From topic)
          Limit 50000;




  • The probability of each piece of data is not exactly the same
      Although all the data obtained is random, the probability of each is not the same. For example, when <=, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to >=, the first piece of data obtained is generally too small.
      Use the result of >=
      - The more data is in front, the lower the probability of getting it
      - But even with very low probability, there is always a chance at the top, so the first one is generally small
      - When the data density is too large, the number obtained will be very small



The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2, otherwise it will deviate more (not necessarily too large or too small).



Solution 3: Using the temporary table temporary



Solution 2 Focus on avoiding sorting with rand() without indexing, but think about another solution here, sorting with the added rand() after indexing. Create a temporary table containing only the primary key id and the index column randomId that needs to be sorted, and then get the out-of-order data after the sorting is completed.



sql
    Drop temporary table if exists temp_topic;
    Create temporary table temp_topic (
      Id bigint primary key not null,
      randomId double not null,
      Index (randomId)
    )
      As
        Select
          Id,
          Rand() as randomId
        From topic;
    Select t.*
    From topic t
      Join (
            Select id
            From
                    Select id
                    From temp_topic
                    Order by randomId
                  ) as temp
            Limit 50000
          ) as temp
        On t.id = temp.id;



The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.



to sum up




Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .




| Differences | order by rand() | where | temporary |
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed ​​ | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed ​​is affected by table data complexity | Very large | Very small | Small |










share|improve this question















MySQL Get random bar data



Scenes



There is a need to randomly fetch a specified amount of data from the database, but this problem is surprisingly troublesome.



Suppose there is a data table



sql
Create table topic (
  Id int primary key not null
  Comment 'number',
  Content varchar(20) not null
  Comment 'content'
)
  Comment 'topic table';




The topic table here has two key features
- Primary key can be compared (int)
- There is a trend in the overall primary key (self-increase/decrease)




Solution 1: Directly use order by rand()



You can get random data directly by using order by rand(), and you can get all the data (the order is still random).




  1. According to the result of rand()
        > This step is equivalent to adding a column of data generated by the rand() function to each data, and then sorting the column

  2. Limit the number of queries


sql
    Select *
    From topic
    Order by rand()
    Limit 50000;



But the disadvantage is obvious, speed is a problem, because the data of rand() is not indexed, so it will cause the sorting speed to be very slow.



Randomly fetching 5w data in 10w data, which often takes 6 s 378 ms, this time is really too long.



In fact, order by rand() looks strange, actually equivalent to:



sql
    Select *
    From
          Select
            Topic.*,
            Rand() as order_column
          From topic
        ) as temp
    Order by order_column
    Limit 50000;



Solution 2: Use where to take the middle random value



Since the ordering caused by order by rand() without indexing is too time consuming, we can try to get around this problem.



The following solution is like this




  1. Take a random value between the minimum and maximum values

  2. Determine if the id is greater than (or less than) this random value

  3. Limit the number of queries


sql
Select *
From topic
Where id >= ((select max(id)
              From topic)
            - (select min(id)
                From topic))
            * rand()
            + (select min(id)
              From topic)
Limit 50000;



This method is extremely fast (150 ms), but it is affected by the density of the data. If the data is not average, the total number of data queries will be limited.



So, here's the defect of the method





  • The acquired data is affected by the distribution density



      For example, the data distribution is as follows



      1,100002,100003,100004...199999,200000



      Then using the above code will only get a small amount of data (about 2.5w or so). However, if you change the symbol slightly, change >= to <=, then the average number that can be obtained will be greatly increased (about 7.5w).



          




    The code formatting here has been in error and I can't solve it. . .



    Select *
          From topic
          # Note: The symbols here have been modified.
          Where id >= ((select max(id)
                        From topic)
                      - (select min(id)
                          From topic))
                      * rand()
                      + (select min(id)
                        From topic)
          Limit 50000;




  • The probability of each piece of data is not exactly the same
      Although all the data obtained is random, the probability of each is not the same. For example, when <=, there will always be a phenomenon of the first one. The reason is because the probability of the first ** is too big, because the data retrieval rule when querying the data table is from the first One is the beginning! Even if it is modified to >=, the first piece of data obtained is generally too small.
      Use the result of >=
      - The more data is in front, the lower the probability of getting it
      - But even with very low probability, there is always a chance at the top, so the first one is generally small
      - When the data density is too large, the number obtained will be very small



The density tends to average, and the average of the maximum number of random data obtained is closer to 1/2, otherwise it will deviate more (not necessarily too large or too small).



Solution 3: Using the temporary table temporary



Solution 2 Focus on avoiding sorting with rand() without indexing, but think about another solution here, sorting with the added rand() after indexing. Create a temporary table containing only the primary key id and the index column randomId that needs to be sorted, and then get the out-of-order data after the sorting is completed.



sql
    Drop temporary table if exists temp_topic;
    Create temporary table temp_topic (
      Id bigint primary key not null,
      randomId double not null,
      Index (randomId)
    )
      As
        Select
          Id,
          Rand() as randomId
        From topic;
    Select t.*
    From topic t
      Join (
            Select id
            From
                    Select id
                    From temp_topic
                    Order by randomId
                  ) as temp
            Limit 50000
          ) as temp
        On t.id = temp.id;



The query speed of this method is not very fast (878 ms, compared to the second), and it is still positively related to the amount of data (because the data is to be copied). But with the first one, it is also true random acquisition.



to sum up




Here is a good English article that analyzes random access data: http://jan.kneschke.de/projects/mysql/order-by-rand/, some of which are not valid here, why unknown. . .




| Differences | order by rand() | where | temporary |
| -------------------------------------------- | ----------------- | ----------------- | ----------- |
| Can get all at random | Yes | Almost impossible | Can |
| Speed ​​ | Slow | Very fast | Faster |
| Need a comparable primary key type | No | Yes | No |
| Affected by data distribution density | No | Yes | No |
| Speed ​​is affected by table data complexity | Very large | Very small | Small |







mysql






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 at 19:42

























asked Nov 10 at 2:32









rxliuli

12




12












  • Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
    – Solarflare
    Nov 10 at 9:48












  • @Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
    – rxliuli
    Nov 10 at 14:49










  • If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
    – Solarflare
    Nov 10 at 16:14












  • Does it work any faster if you do select * from topic where id in (select id from topic order by rand() limit 1000)? BTW you're selecting 50% of rows... is that correct?
    – Salman A
    Nov 12 at 14:01












  • First of all, sql select * from topic where id in (select id from topic order by rand() limit 1000) can't run, and the speed of using order by rand() will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
    – rxliuli
    Nov 13 at 19:45


















  • Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
    – Solarflare
    Nov 10 at 9:48












  • @Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
    – rxliuli
    Nov 10 at 14:49










  • If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
    – Solarflare
    Nov 10 at 16:14












  • Does it work any faster if you do select * from topic where id in (select id from topic order by rand() limit 1000)? BTW you're selecting 50% of rows... is that correct?
    – Salman A
    Nov 12 at 14:01












  • First of all, sql select * from topic where id in (select id from topic order by rand() limit 1000) can't run, and the speed of using order by rand() will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
    – rxliuli
    Nov 13 at 19:45
















Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
– Solarflare
Nov 10 at 9:48






Your result will only be as you are expecting if your ids are gapless and sequential (or equidistant) (if your ids are e.g. 1, 100000..100998, you will get count = 999 for >99,9% of your runs). Check for that (e.g. if "min + 1000 = max - 1"). Also, a warning: your two approaches are completely different. The 2nd query ("before") will give you a random subset of your users, while the 1st code will just specify where you start the list, e.g. it will in 100% of cases include the max-id user and basically never the min-id user. I am not sure if that new behaviour is what you intended.
– Solarflare
Nov 10 at 9:48














@Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
– rxliuli
Nov 10 at 14:49




@Solarflare Yes, I already know the reason for this problem is the uneven distribution of data density. In addition, both methods are for random access to data. But both methods are flawed, is there a better way?
– rxliuli
Nov 10 at 14:49












If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
– Solarflare
Nov 10 at 16:14






If you already know the reason for the uneven distribution, I am not entirely sure what your question is about, it sounded like you wanted to know why you don't get values < 500. If not, you need to clarify that. And again: both codes do completely different things. Asking for a better way (to do what!?) is like asking "Oranges and apples are flawed. Is there a better fruit?" The answer might be different if you want to make an apple pie, orange juice or banana bread. So you would need to describe what you want to do (exactly) in order for us to suggest something (different).
– Solarflare
Nov 10 at 16:14














Does it work any faster if you do select * from topic where id in (select id from topic order by rand() limit 1000)? BTW you're selecting 50% of rows... is that correct?
– Salman A
Nov 12 at 14:01






Does it work any faster if you do select * from topic where id in (select id from topic order by rand() limit 1000)? BTW you're selecting 50% of rows... is that correct?
– Salman A
Nov 12 at 14:01














First of all, sql select * from topic where id in (select id from topic order by rand() limit 1000) can't run, and the speed of using order by rand() will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
– rxliuli
Nov 13 at 19:45




First of all, sql select * from topic where id in (select id from topic order by rand() limit 1000) can't run, and the speed of using order by rand() will be very slow, slow to doubt life! Then, I don't want to have only 50% of the data, but I still can't find a particularly good solution. . .
– rxliuli
Nov 13 at 19:45

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235542%2frandom-access-to-the-specified-amount-of-data-is-there-any-better-way%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53235542%2frandom-access-to-the-specified-amount-of-data-is-there-any-better-way%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues