Find tables on which Postgresql doing Sequential Scan

First let’s understand why we should find out the tables for Sequential Scan. Basically what happens is when we design the schema and create indexes, we want that almost in every case the system should use indexes for getting the data, but later somebody writes a query which dosen’t use indexes and do sequential scan and slows down your system at scale.

To solve this part we should first find out on what tables Postgresql is doing sequential scan. For this postgresql provides stats.

Lets do a query on the stats to find tables for sequential scan:

SELECT schemaname, relname,seq_scan, seq_tup_read,seq_tup_read / seq_scan as avg_seq_tup_read FROM pg_stat_all_tables WHERE seq_scan > 0 ORDER BY 5 DESC LIMIT 5;

schemaname | relname  | seq_scan | seq_tup_read | avg_seq_tup_read 
------------+----------+----------+--------------+------------------
 public     | tickets  |        3 |     24000000 |          8000000
 public     | orders5  |       11 |     50000000 |          4545454
 public     | student  |       18 |      5600010 |           311111
 pg_catalog | pg_class |    89061 |     27452103 |              308
 pg_catalog | pg_index |       35 |         1174 |               33
(5 rows)

The above query tells on which tables sequential scan is done , seq_scan provides the count of sequential scan , seq_tup_read tells the total no. of rows read by the system in all the sequential scan and avg_seq_tup_read tells on average how much rows is read by the query.

Now lets see how to figure out whether this is fine or not:

  • First list all the tables on which sequential scan should not be running – in our case lets take tickets tables
  • Now get the total no. of records in tickets table. – in our case it is – 8000000
  • Now check whether system is scanning the entire table or large portion of the table every time
    • as we can see that avg_seq_tup_read is also 8000000 which means system is scanning entire table every time
  • If the system scanning entire table and you know this table would be huge in future that means some queries on this table are actually not using the index – which is not good
  • check from postgresql logs – the queries on this table which take time (there may be a case you need to enable all query logs for postgresql for some time — generally do not enable this for long , might cause downtime in production , a ok setting would be log any query which is greater than 1 sec)
  • Then check using explain analyze whether this query is doing sequential scan – (if in doubt do not run in production system or run in very low load hours)
  • After you discovered the query try to create indexes – we have a blogs on understanding indexes and why system not using your index , you can check out these if you are unable to figure out.

Hypothetical Indexes in PostgreSQL – Need and Usage

In this Blog We will first cover what are hypothetical indexes and what is need of these type of indexes and then secondly  we shall see the practical usage of these indexes . 

Need of Hypothetical Indexes : 

As the name suggests   these are not real indexes,  these are hypothetical indexes i.e.. They are virtual indexes which PostgreSQL  query planner does not consider when running queries . 

 Now the question arises where these Hypothetical Indexes  are useful  ? ? ? 

First Let’s discuss one scenario  , we have a large table which is currently in the production environment  and we need to make some index on live db and we are not sure whether that index will be useful or not , we even don’t know if  by making that index our production environment may be down !!!

So , Solution of above problem will be following : 

  • Lets ignore the risk and make  the index on live table which can result the following scenario 
    • first of all it will take lots of time depending on data present in table 
    • Live queries may get affected badly if we are not sure if the index we are making will increase or decrease the cost .
    • We also do not know the size of the index it may be too large which can again impact the production database server . 
  • Another solution is to replicate the production database to the local dev environment and then apply all the hits and try there and then apply at the production environment .  it seems a very safe and effective approach in almost all cases but this will take too much of time in setting up the things and testing with actual scenario .  
  • Third Solution is Hypothetical Indexes  as this functionality will create imaginary indexes not real indexes .  But  there are some things to note about these indexes :
    •  it creates an index in our connection’s private memory. None of the catalog tables nor the original tables are actually touched. 
    • The only way to see if we can benefit from that index is by running an EXPLAIN <QUERY>.
    •  If you wish to run an EXPLAIN ANALYZE that runs that SQL and provides the run time stats, it would not be able to use that hypothetical index. This is because a hypothetical index does not exist in reality.
    • Currently these indexes will work on BTREE ONLY . However you can try if it works on other type of indexes .

Usage of Hypothetical indexes : 

Installation 

I am using PostgreSQL10 on CentOS7 . 

  • Download hypopg by following command : 
Wget  https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/hypopg_10-1.1.4-1.rhel7.x86_64.rpm
  • Then install it on CentOS7 :
yum install hypopg_10-1.1.4-1.rhel7.x86_64.rpm
  • Now create extension using following query 
testdb=# CREATE EXTENSION hypopg ;
CREATE EXTENSION
  • On Creating extension following functions will be created . 
testdb=#  select proname from pg_proc where proname ilike '%hyp%';
       proname        
----------------------
 hypopg_reset_index
 hypopg_reset
 hypopg_create_index
 hypopg_drop_index
 hypopg
 hypopg_list_indexes
 hypopg_relation_size
 hypopg_get_indexdef

Usage 

Now Let’s make a table with 10 Crores rows as following : 


testdb=# create table orders as select s as orderno, md5(random()::text) as orderitem , now() as order_created from generate_Series(1,100000000) s;
SELECT 100000000

Now check COST of a query by running explain : 

testdb=# explain select * from orders where orderno > 80000  order by order_created desc  limit 100 ;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Limit  (cost=3600088.98..3600089.23 rows=100 width=44)
   ->  Sort  (cost=3600088.98..3688095.27 rows=35202513 width=44)
         Sort Key: order_created DESC
         ->  Seq Scan on orders  (cost=0.00..2254674.25 rows=35202513 width=44)
               Filter: (orderno > 80000)
(5 rows)

Now create the Hypothetical Index : 

testdb=# SELECT * FROM hypopg_create_index('create index order_created_idx on orders(order_created)');
 indexrelid |             indexname             
------------+-----------------------------------
      24797 | <24797>btree_orders_order_created
(1 row)

Now again do Explain to check if above index may be useful or not : 

testdb=# explain select * from orders where orderno > 80000  order by order_created desc  limit 100 ;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.07..4.29 rows=100 width=45)
   ->  Index Scan Backward using "<24797>btree_orders_order_created" on orders  (cost=0.07..4215496.19 rows=99917459 width=45)
         Filter: (orderno > 80000)
(3 rows)

from both Explain command output we can clearly see the diffrence in cost and can also see that planner is using newly created hypothetical index .

We can Drop the index as follows : 

testdb=# select * from hypopg_drop_index(24797);
 hypopg_drop_index 
-------------------
 t
(1 row)

We can also check the estimated size of index created virtually as follows : 

testdb=# SELECT * FROM hypopg_create_index('create index order_created_idx on orders(order_created)');
 indexrelid |             indexname             
------------+-----------------------------------
      24798 | <24798>btree_orders_order_created
(1 row)

testdb=# select * from pg_size_pretty(hypopg_relation_size(24798));
 pg_size_pretty 
----------------
 2990 MB
(1 row)

Now lets create actual index and see what will be the actual size

testdb=# create index order_created_idx on orders(order_created);
CREATE INDEX
testdb=# \di+ order_created_idx
                               List of relations
 Schema |       Name        | Type  |  Owner   | Table  |  Size   | Description 
--------+-------------------+-------+----------+--------+---------+-------------
 public | order_created_idx | index | postgres | orders | 2142 MB | 
(1 row)

As seen estimated and actual size is comparable .

I  Hope it clears the usage if hypothetical index in PostgreSQL .  In one of our blogs we learned about why index is not working  and also how to check on which tables index needed .  

In our future blogs we will talk about how you will get to know about exact index you need to make in database   . 

Stay Tuned to hello worlds . . . 

References : https://www.percona.com/blog/2019/06/21/hypothetical-indexes-in-postgresql/