Hypothetical Indexes in PostgreSQL – Need and Usage

In this Blog We will first cover what are hypothetical indexes and what is need of these type of indexes and then secondly  we shall see the practical usage of these indexes . 

Need of Hypothetical Indexes : 

As the name suggests   these are not real indexes,  these are hypothetical indexes i.e.. They are virtual indexes which PostgreSQL  query planner does not consider when running queries . 

 Now the question arises where these Hypothetical Indexes  are useful  ? ? ? 

First Let’s discuss one scenario  , we have a large table which is currently in the production environment  and we need to make some index on live db and we are not sure whether that index will be useful or not , we even don’t know if  by making that index our production environment may be down !!!

So , Solution of above problem will be following : 

  • Lets ignore the risk and make  the index on live table which can result the following scenario 
    • first of all it will take lots of time depending on data present in table 
    • Live queries may get affected badly if we are not sure if the index we are making will increase or decrease the cost .
    • We also do not know the size of the index it may be too large which can again impact the production database server . 
  • Another solution is to replicate the production database to the local dev environment and then apply all the hits and try there and then apply at the production environment .  it seems a very safe and effective approach in almost all cases but this will take too much of time in setting up the things and testing with actual scenario .  
  • Third Solution is Hypothetical Indexes  as this functionality will create imaginary indexes not real indexes .  But  there are some things to note about these indexes :
    •  it creates an index in our connection’s private memory. None of the catalog tables nor the original tables are actually touched. 
    • The only way to see if we can benefit from that index is by running an EXPLAIN <QUERY>.
    •  If you wish to run an EXPLAIN ANALYZE that runs that SQL and provides the run time stats, it would not be able to use that hypothetical index. This is because a hypothetical index does not exist in reality.
    • Currently these indexes will work on BTREE ONLY . However you can try if it works on other type of indexes .

Usage of Hypothetical indexes : 

Installation 

I am using PostgreSQL10 on CentOS7 . 

  • Download hypopg by following command : 
Wget  https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/hypopg_10-1.1.4-1.rhel7.x86_64.rpm
  • Then install it on CentOS7 :
yum install hypopg_10-1.1.4-1.rhel7.x86_64.rpm
  • Now create extension using following query 
testdb=# CREATE EXTENSION hypopg ;
CREATE EXTENSION
  • On Creating extension following functions will be created . 
testdb=#  select proname from pg_proc where proname ilike '%hyp%';
       proname        
----------------------
 hypopg_reset_index
 hypopg_reset
 hypopg_create_index
 hypopg_drop_index
 hypopg
 hypopg_list_indexes
 hypopg_relation_size
 hypopg_get_indexdef

Usage 

Now Let’s make a table with 10 Crores rows as following : 


testdb=# create table orders as select s as orderno, md5(random()::text) as orderitem , now() as order_created from generate_Series(1,100000000) s;
SELECT 100000000

Now check COST of a query by running explain : 

testdb=# explain select * from orders where orderno > 80000  order by order_created desc  limit 100 ;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Limit  (cost=3600088.98..3600089.23 rows=100 width=44)
   ->  Sort  (cost=3600088.98..3688095.27 rows=35202513 width=44)
         Sort Key: order_created DESC
         ->  Seq Scan on orders  (cost=0.00..2254674.25 rows=35202513 width=44)
               Filter: (orderno > 80000)
(5 rows)

Now create the Hypothetical Index : 

testdb=# SELECT * FROM hypopg_create_index('create index order_created_idx on orders(order_created)');
 indexrelid |             indexname             
------------+-----------------------------------
      24797 | <24797>btree_orders_order_created
(1 row)

Now again do Explain to check if above index may be useful or not : 

testdb=# explain select * from orders where orderno > 80000  order by order_created desc  limit 100 ;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.07..4.29 rows=100 width=45)
   ->  Index Scan Backward using "<24797>btree_orders_order_created" on orders  (cost=0.07..4215496.19 rows=99917459 width=45)
         Filter: (orderno > 80000)
(3 rows)

from both Explain command output we can clearly see the diffrence in cost and can also see that planner is using newly created hypothetical index .

We can Drop the index as follows : 

testdb=# select * from hypopg_drop_index(24797);
 hypopg_drop_index 
-------------------
 t
(1 row)

We can also check the estimated size of index created virtually as follows : 

testdb=# SELECT * FROM hypopg_create_index('create index order_created_idx on orders(order_created)');
 indexrelid |             indexname             
------------+-----------------------------------
      24798 | <24798>btree_orders_order_created
(1 row)

testdb=# select * from pg_size_pretty(hypopg_relation_size(24798));
 pg_size_pretty 
----------------
 2990 MB
(1 row)

Now lets create actual index and see what will be the actual size

testdb=# create index order_created_idx on orders(order_created);
CREATE INDEX
testdb=# \di+ order_created_idx
                               List of relations
 Schema |       Name        | Type  |  Owner   | Table  |  Size   | Description 
--------+-------------------+-------+----------+--------+---------+-------------
 public | order_created_idx | index | postgres | orders | 2142 MB | 
(1 row)

As seen estimated and actual size is comparable .

I  Hope it clears the usage if hypothetical index in PostgreSQL .  In one of our blogs we learned about why index is not working  and also how to check on which tables index needed .  

In our future blogs we will talk about how you will get to know about exact index you need to make in database   . 

Stay Tuned to hello worlds . . . 

References : https://www.percona.com/blog/2019/06/21/hypothetical-indexes-in-postgresql/

Advertisement

A Database Session Leak Can Slow Down Your Database

In this article, I will explain some consequences of session leaks, which I faced when an issue came to me.

Session Leak is very common in developers’ lives. In this article, I will explain some consequences of session leaks, which I faced when an issue came to me and how I came to the root cause of the issue.

Issue and Analysis

IssueLoad is increasing on the server and Postgres queries are consuming CPU and taking time.

I had to resolve this issue and find the root cause of it.

Analysis: After all my debugging, I came to the following analysis:

There is a table that has 200 rows, but the number of live tuples showing there is more than that (around 60K). We are using Postgresql 9.3.

The following are the queries that I ran.

select count(*) from subscriber_offset_manager; 
count 
------- 
200 (1 row) 

SELECT schemaname,relname,n_live_tup,n_dead_tup FROM pg_stat_user_tables where relname='subscriber_offset_manager' ORDER BY n_dead_tup ; 
schemaname | relname | n_live_tup | n_dead_tup 
------------+---------------------------+------------+------------ 
public | subscriber_offset_manager | 61453 | 5 (1 row)

But as seen from pg_stat_activity and pg_locks, we are not able to track any open connection.

SELECT query, state,locktype,mode FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE relation::regclass = 'subscriber_offset_manager'::regclass ; 
query | state | locktype | mode 
-------+-------+----------+------
(0 rows)

I also tried full vacuum on this table. Below were the results:

  • All the times no rows were removed
  • Many times all the live tuples become dead tuples.

Here is the output of the running vacuum command:

vacuum FULL VERBOSE ANALYZE subscriber_offset_manager; 
INFO: vacuuming "public.subscriber_offset_manager" 
INFO: "subscriber_offset_manager": found 0 removable, 67920 nonremovable row versions in714 pages 
DETAIL: 67720 dead row versions cannot be removed yet. CPU 0.01s/0.06u sec elapsed 0.13 sec. 

INFO: analyzing "public.subscriber_offset_manager" 
INFO: "subscriber_offset_manager": scanned 710 of 710 pages, containing 200 live rows and67720 dead rows; 200 rows in sample, 200 estimated total rows VACUUM 
after that i checked for live and dead tuples for that table as follows : 


SELECT schemaname,relname,n_live_tup,n_dead_tup FROM pg_stat_user_tables where relname='subscriber_offset_manager' ORDER BY n_dead_tup 

schemaname | relname | n_live_tup | n_dead_tup 
------------+---------------------------+------------+------------ 
public | subscriber_offset_manager | 200 | 67749

After 10 seconds:

SELECT schemaname,relname,n_live_tup,n_dead_tup FROM pg_stat_user_tables where relname='subscriber_offset_manager' ORDER BY n_dead_tup ;

schemaname | relname | n_live_tup | n_dead_tup
------------+---------------------------+------------+------------ 
public | subscriber_offset_manager | 68325 | 132

All the dead tuples moved to live tuples instead of cleaning up.

One more interesting observation: When I stop my Java app and then do a full vacuum, it works fine (number of rows and live tuples become equal). So there is something wrong if we select and update continuously from the Java app.

After all the research and analysis and help for stack overflow and after following many links, I found the following root cause.

Root Cause:

When there is one long-running transaction or a database session leak, then dead tuples are created after the start time of that transaction and will not be cleaned up by the vacuum for all the tables for that database. This is due to the PostgreSQL vacuum process that checks for a transaction ID less than the transaction ID of the oldest transaction for cleaning dead rows. The transaction ID is generated globally.

When I checked, I found a transaction that was opened for too long and when I killed it, the vacuum worked fine.

Please read the below-given links for detailed info and effects of not ending database transaction and some Postgres internals. These links helped me a lot in solving the issue.

1.Question asked by me on StackOverFlow: https://stackoverflow.com/questions/51844616/high-number-of-live-dead-tuples-in-postgresql-vacuum-not-working/51849491?noredirect=1#comment90728543_51849491

2. When Postgresql Vacuum does not Work: https://www.cybertec-postgresql.com/en/reasons-why-vacuum-wont-remove-dead-rows/

3. Consequences of not ending a database transaction: https://stackoverflow.com/questions/33815267/what-are-the-consequences-of-not-ending-a-database-transaction/33815610#33815610

4.Some Postgres internals and how the vacuum works internally in Postgreshttps://fritshoogland.wordpress.com/category/postgresql/