Understand Deadlock in Postgres and how to fix it

In this we will understand what a deadlock is in Postgres and how we can possibly solve this at the application level.

Deadlock in Postgres

if you have seen a log like this in Postgres :

< 2021-09-10 11:25:59.843 IST >ERROR:  deadlock detected
< 2021-09-10 11:25:59.843 IST >DETAIL:  Process 6819 waits for ShareLock on transaction 8139; blocked by process 6711.
        Process 6711 waits for ShareLock on transaction 8141; blocked by process 6819.
        Process 6819: update test  set abc = 12 where abc = 1;
        Process 6711: update test set abc = 12 where abc =3;

you have seen a deadlock. Now let’s understand what a deadlock is. Deadlock logically signifies person A is waiting for person B and in turn person B is waiting on Person A , now both of them cannot move forward, so a deadlock.

Now lets try to decode the deadlock in postgres log above:

  • In this i have created two transaction
  • transaction A first update table <test> where column <abc> value is 1 taking lock on all rows with value abc = 1.
  • then transaction B update table <test> where column <abc> value is 3 taking lock on all rows with value abc = 3.
  • now transaction A wants to update table <test> where column <abc> value is 3 but cannot do it as transaction B is having a lock on rows with <abc> column value 3.
  • now transaction B wants to update table <test> where column <abc> value is 1 but cannot do it as transaction A is having a lock on rows with <abc> column value 1.
  • now these two transactions are part of a deadlock and no one can move forward.
  • as soon a postgres detect this it will try to break the deadlock and rollback one of the transactions.

There are various kinds of lock which are there is postgres, following picture shows you which locks are there and their blocking type:

We will not go in much detail of the locks type in here.

How to resolve deadlock

  • Order the transaction is such a way so that locks are always taken in the same order (very common basic solution for this)
    • Eg : your application lets say need to work on 100 objects at a time in a single transaction and multiple such threads can work at the same time
    • In this case sort those 100 objects on some common identifier present in that same.
    • Now do the transactions , now these threads will not create the deadlock as shown.
  • Normally do not work on lots of objects in single transaction, this is not just bad for deadlock but also it might be a bigger transaction which can slow down the DB.
  • If you cannot order or finalize you can create a pipeline in the application level only – normally one should not do this unless no other solution is feasible.

Now if in your case you are still not able to resolve the deadlock in your system , do let us know on the comments section we will try to help you out.

Automatic indexes recommendations in PostgreSQL

In our last blog we learned about  Need and Usage of Hypothetical indexes in Postgresql  . Now we can check easily in live environment also if some particular index will be helpful or not , but  how we get to know which index to test  .  It requires in-depth knowledge of indexing and experience in Postgresql , But in PostgreSql we can get automatic recommendation of indexes for specific queries by using three extensions hypog,pg_stat_statements,pg_qualstats

Now Let’s move directly to the practical part how we can use  this feature in postgres  : 

i am doing experiments on Postgres10 installed on Centos7 .


yum install pg_qualstats10.x86_64
  • Change the following in postgresql.conf and restart postgresql 
shared_preload_libraries = 'pg_stat_statements, pg_qualstats' 
  • Create following extensions : 
testdb=# CREATE EXTENSION hypopg ;
testdb=# CREATE EXTENSION pg_stat_statements ;
testdb=# CREATE EXTENSION pg_qualstats;
  • Set sample rate of pgqual stats to 1 in postgresql.conf  . This rate define how frequently monitor  and analyze the queries . Value ‘1’ represents that keep track of all queries
pg_qualstats.sample_rate = '1'
  • Create the function which will be used to detect usable indexes : 
CREATE OR REPLACE FUNCTION find_usable_indexes()
    l_queries     record;
    l_querytext     text;
    l_idx_def       text;
    l_bef_exp       text;
    l_after_exp     text;
    hypo_idx      record;
    l_attr        record;
    /* l_err       int; */
    CREATE TABLE IF NOT EXISTS public.idx_recommendations (queryid bigint, 
    query text, current_plan jsonb, recmnded_index text, hypo_plan jsonb);
    FOR l_queries IN
    SELECT t.relid, t.relname, t.queryid, t.attnames, t.attnums, 
    pg_qualstats_example_query(t.queryid) as query
         SELECT qs.relid::regclass AS relname, qs.relid AS relid, qs.queryid, 
         string_agg(DISTINCT attnames.attnames,',') AS attnames, qs.attnums
         FROM pg_qualstats_all qs
         JOIN pg_qualstats q ON q.queryid = qs.queryid
         JOIN pg_stat_statements ps ON q.queryid = ps.queryid
         JOIN pg_amop amop ON amop.amopopr = qs.opno
         JOIN pg_am ON amop.amopmethod = pg_am.oid,
               SELECT pg_attribute.attname AS attnames
               FROM pg_attribute
               JOIN unnest(qs.attnums) a(a) ON a.a = pg_attribute.attnum 
               AND pg_attribute.attrelid = qs.relid
               ORDER BY pg_attribute.attnum) attnames,     
         LATERAL unnest(qs.attnums) attnum(attnum)
               WHERE NOT 
                       SELECT 1
                       FROM pg_index i
                       WHERE i.indrelid = qs.relid AND 
                       (arraycontains((i.indkey::integer[])[0:array_length(qs.attnums, 1) - 1], 
                        qs.attnums::integer[]) OR arraycontains(qs.attnums::integer[], 
                        (i.indkey::integer[])[0:array_length(i.indkey, 1) + 1]) AND i.indisunique)))
                       GROUP BY qs.relid, qs.queryid, qs.qualnodeid, qs.attnums) t
                       GROUP BY t.relid, t.relname, t.queryid, t.attnames, t.attnums                   
        /* RAISE NOTICE '% : is queryid',l_queries.queryid; */
        execute 'explain (FORMAT JSON) '||l_queries.query INTO l_bef_exp;
        execute 'select hypopg_reset()';
        execute 'SELECT indexrelid,indexname FROM hypopg_create_index(''CREATE INDEX on '||l_queries.relname||'('||l_queries.attnames||')'')' INTO hypo_idx;      
        execute 'explain (FORMAT JSON) '||l_queries.query INTO l_after_exp;
        execute 'select hypopg_get_indexdef('||hypo_idx.indexrelid||')' INTO l_idx_def;
        INSERT INTO public.idx_recommendations (queryid,query,current_plan,recmnded_index,hypo_plan) 
        VALUES (l_queries.queryid,l_querytext,l_bef_exp::jsonb,l_idx_def,l_after_exp::jsonb);        
    END LOOP;    
        execute 'select hypopg_reset()';
$$ LANGUAGE plpgsql;


  • Now Let’s make a table with 10 Crores rows as following : 
testdb=# create table orders as select s as orderno, md5(random()::text) as orderitem , now() as order_created from generate_Series(1,100000000) s;
SELECT 100000000
  • Now lets run a query on which we want to check if need to make any index  
testdb=# select * from orders where orderno = 80000  ;
-[ RECORD 1 ]-+---------------------------------
orderno       | 80000
orderitem     | 03b41c2f32d99e9a597010608946c4c6
order_created | 2021-05-22 17:52:21.435936+05:30
  • Now run following queries to find out  which indexes are recommend by this extension and what are the improvement percentage by applying these indexes hypothetically 

testdb=#           select find_usable_indexes();
(1 row)

testdb=# select b.query, a.recmnded_index,round((((a.current_plan->0->'Plan'->>'Total Cost')::numeric-(hypo_plan->0->'Plan'->>'Total Cost')::numeric)*100/(a.current_plan->0->'Plan'->>'Total Cost')::numeric),2) as percent_improvd FROM idx_recommendations a JOIN pg_stat_statements b ON a.queryid = b.queryid WHERE round((((current_plan->0->'Plan'->>'Total Cost')::numeric-(hypo_plan->0->'Plan'->>'Total Cost')::numeric)*100/(current_plan->0->'Plan'->>'Total Cost')::numeric),2) > 0 order by 3 desc ;
                        query                        |                          recmnded_index                          | percent_improvd 
 select * from orders where orderno = $1             | CREATE INDEX ON public.orders USING btree (orderno)              |          100.00

Above  analysis was internally done by creating the indexes hypothetically not by making real indexes .

Please Note here that do not fully rely on automatic index recommendation . Yes we have no doubt it is very very useful feature  , but please also check logically why these recommended indexes are useful and should you really create it or not. 

You can read PostgreSQL Index Tutorial Series for basic in depth understanding of indexes in postgresql . 

Now , You can try the explained feature with more complex queries and comment on this article explaining your result  with your queries . So let’s experiment and comment . 

Refrences : https://www.percona.com/blog/2019/07/22/automatic-index-recommendations-in-postgresql-using-pg_qualstats-and-hypopg/

Hypothetical Indexes in PostgreSQL – Need and Usage

In this Blog We will first cover what are hypothetical indexes and what is need of these type of indexes and then secondly  we shall see the practical usage of these indexes . 

Need of Hypothetical Indexes : 

As the name suggests   these are not real indexes,  these are hypothetical indexes i.e.. They are virtual indexes which PostgreSQL  query planner does not consider when running queries . 

 Now the question arises where these Hypothetical Indexes  are useful  ? ? ? 

First Let’s discuss one scenario  , we have a large table which is currently in the production environment  and we need to make some index on live db and we are not sure whether that index will be useful or not , we even don’t know if  by making that index our production environment may be down !!!

So , Solution of above problem will be following : 

  • Lets ignore the risk and make  the index on live table which can result the following scenario 
    • first of all it will take lots of time depending on data present in table 
    • Live queries may get affected badly if we are not sure if the index we are making will increase or decrease the cost .
    • We also do not know the size of the index it may be too large which can again impact the production database server . 
  • Another solution is to replicate the production database to the local dev environment and then apply all the hits and try there and then apply at the production environment .  it seems a very safe and effective approach in almost all cases but this will take too much of time in setting up the things and testing with actual scenario .  
  • Third Solution is Hypothetical Indexes  as this functionality will create imaginary indexes not real indexes .  But  there are some things to note about these indexes :
    •  it creates an index in our connection’s private memory. None of the catalog tables nor the original tables are actually touched. 
    • The only way to see if we can benefit from that index is by running an EXPLAIN <QUERY>.
    •  If you wish to run an EXPLAIN ANALYZE that runs that SQL and provides the run time stats, it would not be able to use that hypothetical index. This is because a hypothetical index does not exist in reality.
    • Currently these indexes will work on BTREE ONLY . However you can try if it works on other type of indexes .

Usage of Hypothetical indexes : 


I am using PostgreSQL10 on CentOS7 . 

  • Download hypopg by following command : 
Wget  https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/hypopg_10-1.1.4-1.rhel7.x86_64.rpm
  • Then install it on CentOS7 :
yum install hypopg_10-1.1.4-1.rhel7.x86_64.rpm
  • Now create extension using following query 
testdb=# CREATE EXTENSION hypopg ;
  • On Creating extension following functions will be created . 
testdb=#  select proname from pg_proc where proname ilike '%hyp%';


Now Let’s make a table with 10 Crores rows as following : 

testdb=# create table orders as select s as orderno, md5(random()::text) as orderitem , now() as order_created from generate_Series(1,100000000) s;
SELECT 100000000

Now check COST of a query by running explain : 

testdb=# explain select * from orders where orderno > 80000  order by order_created desc  limit 100 ;
                                   QUERY PLAN                                   
 Limit  (cost=3600088.98..3600089.23 rows=100 width=44)
   ->  Sort  (cost=3600088.98..3688095.27 rows=35202513 width=44)
         Sort Key: order_created DESC
         ->  Seq Scan on orders  (cost=0.00..2254674.25 rows=35202513 width=44)
               Filter: (orderno > 80000)
(5 rows)

Now create the Hypothetical Index : 

testdb=# SELECT * FROM hypopg_create_index('create index order_created_idx on orders(order_created)');
 indexrelid |             indexname             
      24797 | <24797>btree_orders_order_created
(1 row)

Now again do Explain to check if above index may be useful or not : 

testdb=# explain select * from orders where orderno > 80000  order by order_created desc  limit 100 ;
                                                          QUERY PLAN                                                           
 Limit  (cost=0.07..4.29 rows=100 width=45)
   ->  Index Scan Backward using "<24797>btree_orders_order_created" on orders  (cost=0.07..4215496.19 rows=99917459 width=45)
         Filter: (orderno > 80000)
(3 rows)

from both Explain command output we can clearly see the diffrence in cost and can also see that planner is using newly created hypothetical index .

We can Drop the index as follows : 

testdb=# select * from hypopg_drop_index(24797);
(1 row)

We can also check the estimated size of index created virtually as follows : 

testdb=# SELECT * FROM hypopg_create_index('create index order_created_idx on orders(order_created)');
 indexrelid |             indexname             
      24798 | <24798>btree_orders_order_created
(1 row)

testdb=# select * from pg_size_pretty(hypopg_relation_size(24798));
 2990 MB
(1 row)

Now lets create actual index and see what will be the actual size

testdb=# create index order_created_idx on orders(order_created);
testdb=# \di+ order_created_idx
                               List of relations
 Schema |       Name        | Type  |  Owner   | Table  |  Size   | Description 
 public | order_created_idx | index | postgres | orders | 2142 MB | 
(1 row)

As seen estimated and actual size is comparable .

I  Hope it clears the usage if hypothetical index in PostgreSQL .  In one of our blogs we learned about why index is not working  and also how to check on which tables index needed .  

In our future blogs we will talk about how you will get to know about exact index you need to make in database   . 

Stay Tuned to hello worlds . . . 

References : https://www.percona.com/blog/2019/06/21/hypothetical-indexes-in-postgresql/

Understanding Postgresql Indexes For Beginners- Part -5 (Partial Indexes)

In last blog we learned about Gin indexes. By now we have covered the basic indexes we use in Postgresql.

Now lets look at one of the ways by which we can optimize our indexes for business specific use case.

Partial Indexing

Partial Indexing means that we can create index on some subset of the actual rows of the table.

This type of indexing is very very useful in cases where queries are required on the subset very often.

Lets take a example:

We have a ticketing table(subject text, status varchar) which contains subject of tickets and status whether the ticket is OPEN or CLOSED. Now lets say in our case we primarily search on open tickets only and in general CLOSED Tickets would be increasing constantly.

Lets assume on our ticket table size – 100 million rows , out of which 5 million are OPEN and 95 million are closed.

Now if we create a index on entire table (100 million rows) our index would also be large and every time if we want to search on OPEN ticket only , still we are loading that huge index.

Now with Partial index we can create Gin (text search) only on OPEN ticket rows our index would only work on OPEN Tickets (only 5 million rows), in such case our index would be very small and efficient and fast to load in memory.

How to use

lets create table for tickets only:

create table tickets(subject  tsvector , status varchar);

INSERT INTO tickets ( subject , status) SELECT md5(random()::text)::tsvector,CASE WHEN g < 500000 THEN 'OPEN' ELSE 'CLOSED' END FROM generate_series(1,8000000) as g;

Now lets create a complete index and partial index:

create index t_s_idx on tickets using gin(subject);

Partial Index :

create index t_s_open_idx on tickets using gin(subject) where status='OPEN';

create index t_s_open_idx on tickets using gin(subject) where <CONDITION EXPRESSION>

Size diff. between complete and partial

\di+ t_s_open_idx 
                            List of relations
 Schema |     Name     | Type  |  Owner   |  Table  | Size  | Description 
 public | t_s_open_idx | index | postgres | tickets | 40 MB | 
(1 row)

saarthi=# \di+ t_s_idx 
                          List of relations
 Schema |  Name   | Type  |  Owner   |  Table  |  Size  | Description 
 public | t_s_idx | index | postgres | tickets | 662 MB | 
(1 row)

It is clearly visible that index size is 662 MB vs 40 MB and in general as OPEN tickets will not be constantly increasing this index would be of same size and mostly can remain in memory.

When Partial index would work

Partial Index would work only when the query also contain the search condition which is present while creating the index.

Like in our ticketing case index has condition status =’OPEN’ and if the query also has the condition status=’OPEN’ only then it works.

Most index like Btree, Gin , BRIN can be used as a partial index. Using partial index is one of the most powerful tool to speed up your system.

Understanding Postgresql Indexes For Beginners- Part -4 (Gin)

In the last blog we learned about BRIN index parameters and when to choose it. In this we will learn basic of Gin index.

Gin index is the that can speed up your full text search, lets understand how.

Gin Index

Gin Index – (Generalized Inverted Index) – An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page. If you want more detail reading go to.

Now in case of Postgresql when we create any Gin index on a text column or the type of column is tsvector (A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word )

Gin does is take all the distinct lexemes and break them into trigrams and map to the rows it points to and it will point to multiple rows if multiple rows contains that lexemes.

Now in this way when we search the any three character in the column and gin index will return the result in fast manner.

Let’s see this in action:

First create a table:

create table student(name  text , address tsvector);

Now let’s add some data:

INSERT INTO student ( name , address) SELECT md5(random()::text), md5(random()::text)::tsvector FROM generate_series(1,800000) as g;

Now lets check for search in name (text) and address (ts_vector fields)

 explain analyze select * from student where name ilike '%puneet%';
                                                        QUERY PLAN                                                         
 Gather  (cost=1000.00..98897.03 rows=1887 width=64) (actual time=316.414..319.296 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..97708.33 rows=786 width=64) (actual time=307.882..307.883 rows=0 loops=3)
         Filter: (name ~~* '%puneet%'::text)
         Rows Removed by Filter: 266667
 Planning time: 0.089 ms
 Execution time: 319.353 ms

explain analyze select * from student where address @@ to_tsquery('gurgaon'); 
                                                         QUERY PLAN                                                          
 Gather  (cost=1000.00..154940.76 rows=3956 width=74) (actual time=515.955..518.074 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..153545.16 rows=1648 width=74) (actual time=507.235..507.235 rows=0 loops=3)
         Filter: (address @@ to_tsquery('gurgaon'::text))
         Rows Removed by Filter: 266667
 Planning time: 0.125 ms
 Execution time: 518.102 ms
(8 rows)

Now you see its taking 391 ms and 518 ms

Let’s create indexes :

create index a_g_idx on student using gin(address);


create index name_g_idx on student using gin(name gin_trgm_ops);

Now let’s see the time:

explain analyze select * from student where address @@ to_tsquery('gurgaon'); 
                                                     QUERY PLAN                                                     
 Bitmap Heap Scan on student  (cost=55.25..13838.72 rows=4000 width=74) (actual time=0.067..0.068 rows=0 loops=1)
   Recheck Cond: (address @@ to_tsquery('gurgaon'::text))
   ->  Bitmap Index Scan on a_g_idx  (cost=0.00..54.25 rows=4000 width=0) (actual time=0.063..0.064 rows=0 loops=1)
         Index Cond: (address @@ to_tsquery('gurgaon'::text))
 Planning time: 0.472 ms
 Execution time: 0.140 ms

explain analyze select * from student where name ilike '%puneet%';
                                                     QUERY PLAN                                                      
 Bitmap Heap Scan on student  (cost=60.62..373.33 rows=80 width=74) (actual time=0.032..0.033 rows=0 loops=1)
   Recheck Cond: (name ~~* '%puneet%'::text)
   ->  Bitmap Index Scan on name_g_idx  (cost=0.00..60.60 rows=80 width=0) (actual time=0.029..0.030 rows=0 loops=1)
         Index Cond: (name ~~* '%puneet%'::text)
 Planning time: 6.547 ms
 Execution time: 0.083 ms

you see a huge difference in timings : from 319 ms to 0.14 and 518 to 0.083

Now we have seen the power of Gin Index in Postgresql and what text search in can provide.

Surely there are various parameters to configure and various other use cases of text search that can be done with gin.

Here our purpose was to understand its basics, in future we will work on details also.

In next blog we will work on Partial Indexs. Stay tuned and subscribe our blog for more such blogs.

Understanding Postgresql Indexes for Beginners – Part 4 (BRIN Indexes Optimization)

In last blog we learned about BRIN indexes , in this we will go in little detail on optimizing BRIN indexes and parameters for its optimization.

As we learned that BRIN index basically store the summary information (min and max ) for a particular page range(preferrably adjacent) such that it becomes easy to find which pages to cross check for a particular filter.

In this we will understand the BRIN index update mechanism and optimize the page range parameter.

When BRIN INDEX is updated

  • Index Creation
    • When the index is created it will create summary for all the pages of the table.
  • Table Update
    • For all page which are already summarized if any entry added / updated / removed , the index is updated in the transaction only
    • For new pages which are not summarized
      • Manual Update
        • By default BRIN does not auto update for this one needs to run the function –
          • select brin_summarize_new_values('<index name>'::regclass);
      • Auto Update
        • On autovacuum system will add the summary info for those blocks as well, we need to create index with auto vacuum on like this
          • create index c_t_brin_idx on calllog using brin(call_time) with (pages_per_range = 32,autosummarize = on);

How to optimize BRIN Indexes

Primary parameter to optimize BRIN indexes is the pages_per_range , this balance between no. of pages which needs to be scanned (because of lossy nature) vs no. of index entry created and index maintenance needs.

This parameter needs to be optimized based on individual needs , we will try to provide some guiding rules here, but before that we should do some math here and calculate the number of pages and rows per page:

Calculate no. of pages :

select relpages from pg_class where relname = 'student';;

Now the number of pages is 5406

Now lets calculate total records :

select count(*) from student;

Now we have total rows as 1 million , lets count no. of record per page:

select 1000000/5406 as row_per_page;

Now we see that row per page is 184 in our case, this will be diff. in you case so before proceeding check in your table.

and default value is 128 in page per range — so minimum values our index need to filter for every query is : 184*128 = 23552

Now lets try to see the rules, also rules needs to be balanced for best performance as in databases everything depends on your needs:

  • Table should be Large – something > 5 millions
    • if less data then benefit of small size of index would be less and Btree would also be small.
  • Data should primarily be INSERT Only
    • If we are constantly updating the data and also the data is updated in such a way that single value occur in multiple page range (eg: page 1-10 raneg is 100 -550 and page 11-20 range is 250-800), in this case the system needs to query both range pages which will slow down the filter as pages increase
  • Data should be inserted in IN ORDER
    • If we are inserting data in such a way that the column in which we are creating index that single value occur in multiple page range (eg: page 1-10 raneg is 100 -550 and page 11-20 range is 250-800), in this case the system needs to query both range pages which will slow down the filter as pages increase
  • Search record are generally in large numbers > 1000
    • if we are selecting single value Btree will generally out perform the BRIN , but if we are selecting the no. of records in some 1000’s , BRIN will start to perform better at large count.

Al in all one needs to be very clear in the above points before deciding to use BRIN.

In next blog we will move to Gin Indexes.

Subscribe our blog for more indepth tutorials.

Understanding Postgresql Indexes for Beginners – Part 3 (BRIN Index)

In last blogs we understood basics of Btree and Multi Column Btree. In this blog we will see Postgresql BRIN Indexes.

BRIN Index

Block Range Index – It is a revolutionary idea first proposed in 2015 by Alvaro.

The most fundamental difference in this index is rather than storing the actual values in the index and point to rows. It actually stores the range information about the pages in which the rows are stored.

For eg: Lets say we have a student table with 1000 roll no. and these rows are stored in 100 pages and pages 1-10 are adjacent then 10-20 and then 20-30 and so on. If we use BTree index it would create a tree of 1000 roll no values and no. of nodes say for eg in this tree would be 1000. But what BRIN would do is it will store the max value of roll no and min values of roll no for page ranges (lets say page range is of 10 pages) so BRIN index would store 1 values(min and max) for page range 1-20 then 1 values for 10-20 and so on effectively it would store only 10 values rather than 1000.

You see this huge diff. between the values stored in BRIN vs BTree. This marks for huge performance improvement for storing of indexes but i think you would have realized that this is a lossy index

Lossy What?

What we meant by lossy here, lets go by the example again as index is only storing roll no min and roll no max value in pages between 1-10 (which contains 100 roll no.) , now if you check for lest say roll no. – 10 and mix and max value for page 1-10 is 1 – 100 . Now as it is just min and max we are still not sure whether the values 10 exist in the pages or not. For this system needs to go to every row in pages 1-10 and check whether roll no exist or not. This is what we meant as lossy , means index is not confirmng whether the value exist or not.

When to use BRIN

If your data is such that which mostly insert only like logs or history kind of stuff and your business requirement is to query recent logs or some date range logs then it is great to use BRIN as index as this would drastically reduce size of index, index maintenance , you would generally be searching for a range.

Lets take a real life example, create a table with 100000 logs data:

CREATE TABLE calllog (call_time timestamp not null, call_result text , no_of_participant integer); 

INSERT INTO calllog ( call_time, no_of_participant, call_result) SELECT g, CURRENT_TIMESTAMP + ( g || 'minute' ) :: interval, random() * 10, md5(g::text) FROM generate_series(1,8000000) as g;

Now lets search for some cal logs between now() and now() – 5 hour:

explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                 QUERY PLAN                                                  
 Seq Scan on calllog  (cost=0.00..254764.68 rows=1 width=45) (actual time=1469.561..1469.561 rows=0 loops=1)
   Filter: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
   Rows Removed by Filter: 8000000
 Planning time: 0.280 ms
 Execution time: 1469.596 ms

Lets see what happens is we use BTree here :

create index ct_idx on calllog(call_time);

explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                   QUERY PLAN                                                    
 Index Scan using ct_idx on calllog  (cost=0.44..8.46 rows=1 width=45) (actual time=0.040..0.041 rows=0 loops=1)
   Index Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
 Planning time: 2.214 ms
 Execution time: 0.086 ms

You see a huge diff. from 1469 ms to 0.086 ms.

Now lets see what brin would do:

create index c_t_brin_idx  on calllog using brin(call_time) with (pages_per_range = 32,autosummarize = on);

drop  index ct_idx ;

 explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                       QUERY PLAN                                                        
 Bitmap Heap Scan on calllog  (cost=45.02..11385.57 rows=1 width=45) (actual time=2.259..2.260 rows=0 loops=1)
   Recheck Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
   ->  Bitmap Index Scan on c_t_brin_idx  (cost=0.00..45.02 rows=3423 width=0) (actual time=2.256..2.256 rows=0 loops=1)
         Index Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
 Planning time: 0.202 ms
 Execution time: 2.315 ms

You see 2.315 ms still very less than 1469 ms but greater than 0.086 ms.

Why BRIN over BTree when Btree is speedy??

  1. Size of BRIN vs BTree
\di+ c_t_brin_idx 
                             List of relations
 Schema |     Name     | Type  |  Owner   |  Table  |  Size  | Description 
 public | c_t_brin_idx | index | postgres | calllog | 120 kB | 

 \di+ ct_idx 
                          List of relations
 Schema |  Name  | Type  |  Owner   |  Table  |  Size  | Description 
 public | ct_idx | index | postgres | calllog | 171 MB | 

You see there is huge difference in size and when there are lot of tables and indexes in your system such that you cannot have this 172MB in memory, those cases BRIN becomes very powerful

2. Index Maintenance – BTree indexes are costly to maintain as they would changes in every DML operation and has to be done in same transaction. While BRIN is put this offloading to vaccuming.

By Now i guess you understood lots of power of BRIN indexes and also the benefits over BTree.

You can ask your questions in the comment section we will try best to answer in your specific cases:

Next we will see parameters to optimize BRIN and then move to Gin Indexes.

Please subscribe for such indepth blogs.

Tune Linux Kernel Parameters For PostgreSQL Optimization and better System Performance


In my previous Article i explained  Tuning PostgreSQL Database Memory Configuration Parameters to Optimize Performance and as i said  Database performance does not only depend on Postgresql configurations but also on system parameters .Poorly configured OS kernel parameters can cause degradation in database server performance. Therefore, it is imperative that these parameters are configured according to the database server and its workload. In this article  i will be talking about centos/redhat  linux system specially . 


I will start the article with small story where on one of our client huge amount of writes were there and customer have provided us 200 GB of RAM for that dedicated database server , So there were no problem of resources. 

Now what was happening that after sometime system loads get increased so much and on debugging we found no special query  around the time when load increases . Somewhere over internet we found if we clear the system cache  regularly then issue will be resolved . 

We then schedule a cron to clear system cache after some regular interval and issue got resolved . 

Now  the question is why issue was not coming after this ? ? ?

And the Answer is  that due to large cache size as we have so much of ram available   lots of data is collected in RAM (in GB’s) and when this whole data flushes out on to the disk   ,  system load becomes high at that time 

So from that we came to know that we also need to tune some system parameters also to optimize system and database(postgresql) performance . 

In above case we tuned vm.dirty_background_ratio and vm.dirty_ratio , these two system(os) parameters to resolve the issue . 

Kernel parameters Tuning

Now what values we set for these above two parameters described in story and what are all other  important Linux kernel parameters that can affect database server performance which we can tune are described as follows : 

vm.dirty_background_ratio / vm.dirty_background_bytes

The vm.dirty_background_ratio is the percentage of memory filled with dirty pages that need to be flushed to disk. Flushing is done in the background. The value of this parameter ranges from 0 to 100; however, a value lower than 5 may not be effective and some kernels do not internally support it. The default value is 10 on most Linux systems. You can gain performance for write-intensive operations with a lower ratio, which means that Linux flushes dirty pages in the background.

You need to set a value of vm.dirty_background_bytes depending on your disk speed.

There are no “good” values for these two parameters since both depend on the hardware. However, setting vm.dirty_background_ratio to 5 and vm.dirty_background_bytes to 25% of your disk speed improves performance by up to ~25% in most cases.

vm.dirty_ratio / dirty_bytes

This is the same as vm.dirty_background_ratio / dirty_background_bytes except that the flushing is done in the foreground, blocking the application. So vm.dirty_ratio should be higher than vm.dirty_background_ratio. This will ensure that background processes kick in before the foreground processes to avoid blocking the application, as much as possible. You can tune the difference between the two ratios depending on your disk IO


vm.swappiness is another kernel parameter that can affect the performance of the database. This parameter is used to control the swappiness (swapping pages to and from swap memory into RAM) behavior on a Linux system. The value ranges from 0 to 100. It controls how much memory will be swapped or paged out. Zero means disable swap and 100 means aggressive swapping.

You may get good performance by setting lower values.

Setting a value of 0 in newer kernels may cause the OOM Killer (out of memory killer process in Linux) to kill the process. Therefore, you can be on the safe side and set the value to 1 if you want to minimize swapping. The default value on a Linux system is 60. A higher value causes the MMU (memory management unit) to utilize more swap space than RAM, whereas a lower value preserves more data/code in memory.

A smaller value is a good bet to improve performance in PostgreSQL.

vm.overcommit_memory / vm.overcommit_ratio

Applications acquire memory and free that memory when it is no longer needed. But in some cases, an application acquires too much memory and does not release it.  This can invoke the OOM killer. Here are the possible values for vm.overcommit_memory parameter with a description for each:

  1. Heuristic overcommit, Do it intelligently (default); based kernel heuristics
  2. Allow overcommit anyway
  3. Don’t over commit beyond the overcommit ratio.

Reference: https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

vm.overcommit_ratio is the percentage of RAM that is available for overcommitment. A value of 50% on a system with 2 GB of RAM may commit up to 3 GB of RAM.

A value of 2 for vm.overcommit_memory yields better performance for PostgreSQL. This value maximizes RAM utilization by the server process without any significant risk of getting killed by the OOM killer process. An application will be able to overcommit, but only within the overcommit ratio, thus reducing the risk of having OOM killer kill the process. Hence a value to 2 gives better performance than the default 0 value. However, reliability can be improved by ensuring that memory beyond an allowable range is not overcommitted. It avoids the risk of the process being killed by OOM-killer.

On systems without swap, one may experience a problem when vm.overcommit_memory is 2.


Generally speaking almost all applications which uses more memory depends on this , For example  , In Redis setting this value 1 is best . 

Turn On Huge Pages

Linux, by default, uses 4K memory pages, BSD has Super Pages, whereas Windows has Large Pages. A page is a chunk of RAM that is allocated to a process. A process may own more than one page depending on its memory requirements. The more memory a process needs the more pages that are allocated to it. The OS maintains a table of page allocation to processes. The smaller the page size, the bigger the table, the more time required to look up a page in that page table. Therefore, huge pages make it possible to use a large amount of memory with reduced overheads; fewer page lookups, fewer page faults, faster read/write operations through larger buffers. This results in improved performance.

PostgreSQL has support for bigger pages on Linux only. By default, Linux uses 4K of memory pages, so in cases where there are too many memory operations, there is a need to set bigger pages. Performance gains have been observed by using huge pages with sizes 2 MB and up to 1 GB. The size of Huge Page can be set boot time. You can easily check the huge page settings and utilization on your Linux box using cat /proc/meminfo | grep -i huge command.

Get HugePage Info – On Linux (only)

Note: This is only for Linux, for other OS this operation is ignored
$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

In this example, although huge page size is set at 2,048 (2 MB), the total number of huge pages has a value of 0. which signifies that huge pages are disabled.

Script to quantify Huge Pages

This is a simple script which returns the number of Huge Pages required. Execute the script on your Linux box while your PostgreSQL is running. Ensure that $PGDATA environment variable is set to PostgreSQL’s data directory.

Get Number of Required HugePages

pid=head -1 $PGDATA/postmaster.pid
echo “Pid:            $pid”
peak=grep ^VmPeak /proc/$pid/status | awk '{ print $2 }'
echo “VmPeak:            $peak kB”
hps=grep ^Hugepagesize /proc/meminfo | awk '{ print $2 }'
echo “Hugepagesize:   $hps kB”
echo Set Huge Pages:     $hp

The output of the script looks like this:

Script Output

Pid:            12737
VmPeak:         180932 kB
Hugepagesize:   2048 kB
Set Huge Pages: 88

The recommended huge pages are 88, therefore you should set the value to 88.

Set HugePages Command :

sysctl -w vm.nr_hugepages= 88

Check the huge pages now, you will see no huge page is in use (HugePages_Free = HugePages_Total).

Again Get HugePage Info – On Linux (only)

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       88
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now set the parameter huge_pages “on” in $PGDATA/postgresql.conf and restart the server.

And Again Get HugePage Info – On Linux (only)

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       81
HugePages_Rsvd:       64
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now you can see that a very few of the huge pages are used. Let’s now try to add some data into the database.

Some DB Operations to Utilise HugePages

postgres=# CREATE TABLE foo(a INTEGER);
postgres=# INSERT INTO foo VALUES(generate_Series(1,10000000));
INSERT 0 10000000

Let’s see if we are now using more huge pages than before.

Once More Get HugePage Info – On Linux (only)

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       18
HugePages_Rsvd:        1
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now you can see that most of the huge pages are in use.

Note: The sample value for HugePages used here is very low, which is not a normal value for a big production machine. Please assess the required number of pages for your system and set those accordingly depending on your system’s workload and resources.

Now, Tuning Postgresql parameters and kernel parameters is not enough for good Postgresql performance there are many other things like

  • How you are making Query
  • Proper Indexing — For this you can follow indexing series on our blog
  • Proper partitioning and sharding accroding to business usecase
  • and many more .

Stay tuned to get more blogs on optimizing postgresql performance

Refrences : https://www.percona.com/blog/2018/08/29/tune-linux-kernel-parameters-for-postgresql-optimization/

Understanding Postgresql Indexes for Beginners – Part 2 (BTree MultiColumn )

This post is in continuation of Understanding Postgresql Indexes Series – Link for previous blog.

In last blog we understood about B-tree indexes and its use cases. Now we will look into Multi Column Postgresql Indexes. Lets Start :

Multi Column Index Btree :

In this we can include multiple columns into a single index eg: you have a student table with columns class_id , roll_no . With Multicolumn index we will be able to create a index (class_id,roll_no) rather than (class_id) or (roll_no).

The benefits of creating a multi column index is when we want to filter based on multiple column it works much faster. Understand it like for every value of class_id there is another btree of roll_no, now when you query like for class_id = ‘CLASS 8’ and roll_no > 10 then it will first go to ‘CLASS 8’ in this index and then from there there is a sub Btree index on roll_no which helps us to go to all roll_no > 8 very fast.

One of the important thing here is that it works best when we use ‘equal to’ constraint of the left most columns. eg: in our case our query was:

where class_id='CLASS 8' and roll_no > 10

Now in this case in our index :

create index cl_ro_idx on student(class_id,roll_no) 

In the index the left most column is class_id and we have put a equal to condition which is correct . Now lets say we have a index:

create index cl_ro_mar on student(class_id,roll_no,marks)

And we create a query which is like :

where class_id='CLASS_8' and roll_no > 10 and marks < 100

In this case lets see what happens , first the index is traversed for looking class_id = ‘CLASS 8’ same as last , not now we said roll_no > 10 for this all nodes for which roll_no was greater than 10 was traversed and also for every roll_no node there will be a tree of mark so all those trees are also traversed.

In generally it is preferred we traverse when left most variables are equal to types and last one is > or <

How to use?

Lets Create a Table:

create table student as select s as rollno, MOD(s,2)::text as class_id , s*random() as marks   from generate_Series(1,1000000) s;

Now lets create a index on class_id, roll_no,marks:

create index c_r_m_idx on student(class_id,rollno,marks);

Case 1 – Where all left indexes are equal to

explain analyze select * from student where class_id ='0' and rollno='89' and marks < 1;
                                                       QUERY PLAN                                                        
 Index Only Scan using c_r_m_idx on student  (cost=0.42..8.45 rows=1 width=14) (actual time=0.081..0.082 rows=0 loops=1)
   Index Cond: ((class_id = '0'::text) AND (rollno = 89) AND (marks < '1'::double precision))
   Heap Fetches: 0
 Planning time: 0.502 ms
 Execution time: 0.123 ms

Case 1 – Where we do not put equal to on left indexes

saarthi=# explain analyze select * from student where class_id ='0' and rollno>89 and marks < 1;
                                                       QUERY PLAN                                                       
 Gather  (cost=1000.00..13702.57 rows=49 width=14) (actual time=3.115..56.885 rows=5 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..12697.67 rows=20 width=14) (actual time=23.998..45.041 rows=2 loops=3)
         Filter: ((rollno > 89) AND (marks < '1'::double precision) AND (class_id = '0'::text))
         Rows Removed by Filter: 333332
 Planning time: 0.201 ms
 Execution time: 56.927 ms

Here you would see that in case where we are not putting equal to conditions we are getting Sequential Scans.

So be very careful on what type of queries you do.

Next Blog we will work on BRIN indexes and when to use Brin vs Btree.

Stay Tuned. Please Subscribe via email.

Tuning PostgreSQL Database Memory Configuration Parameters to Optimize Performance


Out of the box, the default PostgreSQL configuration is not tuned for any particular workload. Default values are set to ensure that PostgreSQL runs everywhere, with the least resources it can consume and so that it doesn’t cause any vulnerabilities. It has default settings for all of the database parameters. It is primarily the responsibility of the database administrator or developer to tune PostgreSQL according to their system’s workload. In this blog, we will establish basic guidelines for setting PostgreSQL database parameters to improve database performance according to workload.

Bear in mind that while optimizing PostgreSQL server configuration improves performance, a database developer must also be diligent when writing queries for the application. If queries perform full table scans where an index could be used or perform heavy joins or expensive aggregate operations, then the system can still perform poorly even if the database parameters are tuned. It is important to pay attention to performance when writing database queries.

Nevertheless, database parameters are very important too, so let’s take a look at the eight that have the greatest potential to improve performance , But before that it is important to understand the memory architecture  basic of postgres  : 

Memory in PostgreSQL can be classified into two categories:

  1. Local Memory area: It is allocated by each backend process for its own use.
  2. Shared memory area: It is used by all processes of a PostgreSQL server.

So Now i will explain tunable parameters and what should be the value of these tunable parameters   : . 


PostgreSQL uses its own buffer and also uses kernel buffered IO. That means data is stored in memory twice, first in PostgreSQL buffer and then kernel buffer. Unlike other databases, PostgreSQL does not provide direct IO. This is called double buffering. The PostgreSQL buffer is called shared_buffer which is the most effective tunable parameter for most operating systems. This parameter sets how much dedicated memory will be used by PostgreSQL for cache.

The default value of shared_buffer is set very low and you will not get much benefit from that. It’s low because certain machines and operating systems do not support higher values. But in most modern machines, you need to increase this value for optimal performance.

The recommended value is 25% of your total machine RAM. You should try some lower and higher values because in some cases we achieve good performance with a setting over 25%. The configuration really depends on your machine and the working data set. If your working set of data can easily fit into your RAM, then you might want to increase the shared_buffer value to contain your entire database, so that the whole working set of data can reside in cache. That said, you obviously do not want to reserve all RAM for PostgreSQL.

In production environments, it is observed that a large value for shared_buffer gives really good performance, though you should always benchmark to find the right balance.

Alternatively, while a larger shared_buffers value can increase performance in ‘read heavy’ use cases, having a large shared_buffer value can be detrimental for ‘write heavy’ use cases, as the entire contents of shared_buffers must be processed during writes.

Please note that the database server needs to be restarted after this change.

testdb=# SHOW shared_buffers;
(1 row)


Write-Ahead Logging (WAL) is a standard method for ensuring integrity of data. Much like in the shared_buffers setting, PostgreSQL writes WAL records into buffers and then these buffers are flushed to disk.

The default size of the buffer is set by the  wal_buffers setting- initially at 16MB. If the system being tuned has a large number of concurrent connections, then a higher value for  wal_buffers can provide better performance.


effective_cache_size has the reputation of being a confusing PostgreSQL settings, and as such, many times the setting is left to the default value.

The effective_cache_size value provides a ‘rough estimate’ of the number of how much memory is available for disk caching by the operating system and within the database itself, after taking into account what’s used by the OS itself and other applications.

This value is used only by the PostgreSQL query planner to figure out whether plans it’s considering would be expected to fit in RAM or not. As such, it’s a bit of a fuzzy number to define for general use cases.

A conservative value for  effective_cache_size  would be ½(50%) of the total memory available on the system. Most commonly, the value is set to 75% of the total system memory on a dedicated DB server, but can vary depending on the specific discrete needs on a particular server workload.

If the value for effective_cache_size  is too low, then the query planner may decide not to use some indexes, even if they would help greatly increase query speed.

So Conclusively , General recommendation for effective_cache_size is as follows.

  • Set the value to the amount of file system cache available. On UNIX/Linux like systems, add the free+cached numbers from free or top commands to get an estimate
  • If you don’t know, set the value to the 50%  or 75%  of total system memory specially if dedicated DB server


This configuration is used for complex sorting. If you have to do complex sorting then increase the value of work_mem for good results. In-memory sorts are much faster than sorts spilling to disk. Setting a very high value can cause a memory bottleneck for your deployment environment because this parameter is per user sort operation. Therefore, if you have many users trying to execute sort operations, then the system will allocate work_mem * total sort operations  for all users. Setting this parameter globally can cause very high memory usage. So it is highly recommended to modify this at the session level.

12345678910testdb=# SET work_mem TO “2MB”;testdb=# EXPLAIN SELECT * FROM bar ORDER BY bar.b;                                    QUERY PLAN                                     ———————————————————————————–Gather Merge  (cost=509181.84..1706542.14 rows=10000116 width=24)   Workers Planned: 4   ->  Sort  (cost=508181.79..514431.86 rows=2500029 width=24)         Sort Key: b         ->  Parallel Seq Scan on bar  (cost=0.00..88695.29 rows=2500029 width=24)(5 rows)

The initial query’s sort node has an estimated cost of 514431.86. Cost is an arbitrary unit of computation. For the above query, we have a work_mem of only 2MB. For testing purposes, let’s increase this to 256MB and see if there is any impact on cost.

123456789testdb=# SET work_mem TO “256MB”;testdb=# EXPLAIN SELECT * FROM bar ORDER BY bar.b;                                    QUERY PLAN                                     ———————————————————————————–Gather Merge  (cost=355367.34..1552727.64 rows=10000116 width=24)   Workers Planned: 4   ->  Sort  (cost=354367.29..360617.36 rows=2500029 width=24)         Sort Key: b         ->  Parallel Seq Scan on bar  (cost=0.00..88695.29 rows=2500029 width=24)

The query cost is reduced to 360617.36 from 514431.86 — a 30% reduction.

So Conclusively , Setting the value Higher alway results beer sorting and hashing , But setting it in local queries is always recommended . Set value high in queries where you expect high sorting otherwise low global value is good . 


maintenance_work_mem is a memory setting used for maintenance tasks. The default value is 64MB. Setting a large value helps in tasks like VACUUM, RESTORE, CREATE INDEX, ADD FOREIGN KEY and ALTER TABLE.

It is necessary to remember that when autovacuum runs, up to autovacuum_max_workers times this memory may be allocated, so be careful not to set the default value too high.

The default value of maintenance_work_mem = 64MB.

General recommendation to set maintenance_work_mem is as follows.

  • Set the value 10% of system memory, up to 1GB
  • Maybe you can set it even higher if you are having VACUUM problems

We can also temporarily increase this memory while creating indexes or at the time of dump restores or while performing full vacuums . 


This is used to enforce that commit will wait for WAL to be written on disk before returning a success status to the client. This is a trade-off between performance and reliability. If your application is designed such that performance is more important than the reliability, then turn off synchronous_commit. This means that there will be a time gap between the success status and a guaranteed write to disk. In the case of a server crash, data might be lost even though the client received a success message on commit. In this case, a transaction commits very quickly because it will not wait for a WAL file to be flushed, but reliability is compromised.


This parameter sets the maximum number of temporary buffers used by each database session. The session local buffers are used only for access to temporary tables. The setting of this parameter can be changed within individual sessions but only before the first use of temporary tables within the session.

PostgreSQL database utilizes this memory area for holding the temporary tables of each session, these will be cleared when the connection is closed.

The default value of temp_buffer = 8MB.


There are more parameters that can be tuned to gain better performance but those have less impact than the ones highlighted here. In the end, we must always keep in mind that not all parameters are relevant for all applications types. Some applications perform better by tuning a parameter and some don’t. Tuning PostgreSQL Database Parameters must be done for the specific needs of an application and the OS it runs on.  

Also ,  Performance tuning does not only depends on postgres configuration parameters ,  there are many system parameters also on which postgresql performance depends So, in my next blog i will explain some system parameters which can affect postgresql performance .