Understanding Postgresql Indexes for Beginners – Part 4 (BRIN Indexes Optimization)

In last blog we learned about BRIN indexes , in this we will go in little detail on optimizing BRIN indexes and parameters for its optimization.

As we learned that BRIN index basically store the summary information (min and max ) for a particular page range(preferrably adjacent) such that it becomes easy to find which pages to cross check for a particular filter.

In this we will understand the BRIN index update mechanism and optimize the page range parameter.

When BRIN INDEX is updated

  • Index Creation
    • When the index is created it will create summary for all the pages of the table.
  • Table Update
    • For all page which are already summarized if any entry added / updated / removed , the index is updated in the transaction only
    • For new pages which are not summarized
      • Manual Update
        • By default BRIN does not auto update for this one needs to run the function –
          • select brin_summarize_new_values('<index name>'::regclass);
      • Auto Update
        • On autovacuum system will add the summary info for those blocks as well, we need to create index with auto vacuum on like this
          • create index c_t_brin_idx on calllog using brin(call_time) with (pages_per_range = 32,autosummarize = on);

How to optimize BRIN Indexes

Primary parameter to optimize BRIN indexes is the pages_per_range , this balance between no. of pages which needs to be scanned (because of lossy nature) vs no. of index entry created and index maintenance needs.

This parameter needs to be optimized based on individual needs , we will try to provide some guiding rules here, but before that we should do some math here and calculate the number of pages and rows per page:

Calculate no. of pages :

select relpages from pg_class where relname = 'student';;

Now the number of pages is 5406

Now lets calculate total records :

select count(*) from student;

Now we have total rows as 1 million , lets count no. of record per page:

select 1000000/5406 as row_per_page;

Now we see that row per page is 184 in our case, this will be diff. in you case so before proceeding check in your table.

and default value is 128 in page per range — so minimum values our index need to filter for every query is : 184*128 = 23552

Now lets try to see the rules, also rules needs to be balanced for best performance as in databases everything depends on your needs:

  • Table should be Large – something > 5 millions
    • if less data then benefit of small size of index would be less and Btree would also be small.
  • Data should primarily be INSERT Only
    • If we are constantly updating the data and also the data is updated in such a way that single value occur in multiple page range (eg: page 1-10 raneg is 100 -550 and page 11-20 range is 250-800), in this case the system needs to query both range pages which will slow down the filter as pages increase
  • Data should be inserted in IN ORDER
    • If we are inserting data in such a way that the column in which we are creating index that single value occur in multiple page range (eg: page 1-10 raneg is 100 -550 and page 11-20 range is 250-800), in this case the system needs to query both range pages which will slow down the filter as pages increase
  • Search record are generally in large numbers > 1000
    • if we are selecting single value Btree will generally out perform the BRIN , but if we are selecting the no. of records in some 1000’s , BRIN will start to perform better at large count.

Al in all one needs to be very clear in the above points before deciding to use BRIN.

In next blog we will move to Gin Indexes.

Subscribe our blog for more indepth tutorials.

Understanding Postgresql Indexes for Beginners – Part 3 (BRIN Index)

In last blogs we understood basics of Btree and Multi Column Btree. In this blog we will see Postgresql BRIN Indexes.

BRIN Index

Block Range Index – It is a revolutionary idea first proposed in 2015 by Alvaro.

The most fundamental difference in this index is rather than storing the actual values in the index and point to rows. It actually stores the range information about the pages in which the rows are stored.

For eg: Lets say we have a student table with 1000 roll no. and these rows are stored in 100 pages and pages 1-10 are adjacent then 10-20 and then 20-30 and so on. If we use BTree index it would create a tree of 1000 roll no values and no. of nodes say for eg in this tree would be 1000. But what BRIN would do is it will store the max value of roll no and min values of roll no for page ranges (lets say page range is of 10 pages) so BRIN index would store 1 values(min and max) for page range 1-20 then 1 values for 10-20 and so on effectively it would store only 10 values rather than 1000.

You see this huge diff. between the values stored in BRIN vs BTree. This marks for huge performance improvement for storing of indexes but i think you would have realized that this is a lossy index

Lossy What?

What we meant by lossy here, lets go by the example again as index is only storing roll no min and roll no max value in pages between 1-10 (which contains 100 roll no.) , now if you check for lest say roll no. – 10 and mix and max value for page 1-10 is 1 – 100 . Now as it is just min and max we are still not sure whether the values 10 exist in the pages or not. For this system needs to go to every row in pages 1-10 and check whether roll no exist or not. This is what we meant as lossy , means index is not confirmng whether the value exist or not.

When to use BRIN

If your data is such that which mostly insert only like logs or history kind of stuff and your business requirement is to query recent logs or some date range logs then it is great to use BRIN as index as this would drastically reduce size of index, index maintenance , you would generally be searching for a range.

Lets take a real life example, create a table with 100000 logs data:

CREATE TABLE calllog (call_time timestamp not null, call_result text , no_of_participant integer); 

INSERT INTO calllog ( call_time, no_of_participant, call_result) SELECT g, CURRENT_TIMESTAMP + ( g || 'minute' ) :: interval, random() * 10, md5(g::text) FROM generate_series(1,8000000) as g;

Now lets search for some cal logs between now() and now() – 5 hour:

explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                 QUERY PLAN                                                  
 Seq Scan on calllog  (cost=0.00..254764.68 rows=1 width=45) (actual time=1469.561..1469.561 rows=0 loops=1)
   Filter: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
   Rows Removed by Filter: 8000000
 Planning time: 0.280 ms
 Execution time: 1469.596 ms

Lets see what happens is we use BTree here :

create index ct_idx on calllog(call_time);

explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                   QUERY PLAN                                                    
 Index Scan using ct_idx on calllog  (cost=0.44..8.46 rows=1 width=45) (actual time=0.040..0.041 rows=0 loops=1)
   Index Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
 Planning time: 2.214 ms
 Execution time: 0.086 ms

You see a huge diff. from 1469 ms to 0.086 ms.

Now lets see what brin would do:

create index c_t_brin_idx  on calllog using brin(call_time) with (pages_per_range = 32,autosummarize = on);

drop  index ct_idx ;

 explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                       QUERY PLAN                                                        
 Bitmap Heap Scan on calllog  (cost=45.02..11385.57 rows=1 width=45) (actual time=2.259..2.260 rows=0 loops=1)
   Recheck Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
   ->  Bitmap Index Scan on c_t_brin_idx  (cost=0.00..45.02 rows=3423 width=0) (actual time=2.256..2.256 rows=0 loops=1)
         Index Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
 Planning time: 0.202 ms
 Execution time: 2.315 ms

You see 2.315 ms still very less than 1469 ms but greater than 0.086 ms.

Why BRIN over BTree when Btree is speedy??

  1. Size of BRIN vs BTree
\di+ c_t_brin_idx 
                             List of relations
 Schema |     Name     | Type  |  Owner   |  Table  |  Size  | Description 
 public | c_t_brin_idx | index | postgres | calllog | 120 kB | 

 \di+ ct_idx 
                          List of relations
 Schema |  Name  | Type  |  Owner   |  Table  |  Size  | Description 
 public | ct_idx | index | postgres | calllog | 171 MB | 

You see there is huge difference in size and when there are lot of tables and indexes in your system such that you cannot have this 172MB in memory, those cases BRIN becomes very powerful

2. Index Maintenance – BTree indexes are costly to maintain as they would changes in every DML operation and has to be done in same transaction. While BRIN is put this offloading to vaccuming.

By Now i guess you understood lots of power of BRIN indexes and also the benefits over BTree.

You can ask your questions in the comment section we will try best to answer in your specific cases:

Next we will see parameters to optimize BRIN and then move to Gin Indexes.

Please subscribe for such indepth blogs.

Understanding Postgresql Indexes for Beginners – Part 2 (BTree MultiColumn )

This post is in continuation of Understanding Postgresql Indexes Series – Link for previous blog.

In last blog we understood about B-tree indexes and its use cases. Now we will look into Multi Column Postgresql Indexes. Lets Start :

Multi Column Index Btree :

In this we can include multiple columns into a single index eg: you have a student table with columns class_id , roll_no . With Multicolumn index we will be able to create a index (class_id,roll_no) rather than (class_id) or (roll_no).

The benefits of creating a multi column index is when we want to filter based on multiple column it works much faster. Understand it like for every value of class_id there is another btree of roll_no, now when you query like for class_id = ‘CLASS 8’ and roll_no > 10 then it will first go to ‘CLASS 8’ in this index and then from there there is a sub Btree index on roll_no which helps us to go to all roll_no > 8 very fast.

One of the important thing here is that it works best when we use ‘equal to’ constraint of the left most columns. eg: in our case our query was:

where class_id='CLASS 8' and roll_no > 10

Now in this case in our index :

create index cl_ro_idx on student(class_id,roll_no) 

In the index the left most column is class_id and we have put a equal to condition which is correct . Now lets say we have a index:

create index cl_ro_mar on student(class_id,roll_no,marks)

And we create a query which is like :

where class_id='CLASS_8' and roll_no > 10 and marks < 100

In this case lets see what happens , first the index is traversed for looking class_id = ‘CLASS 8’ same as last , not now we said roll_no > 10 for this all nodes for which roll_no was greater than 10 was traversed and also for every roll_no node there will be a tree of mark so all those trees are also traversed.

In generally it is preferred we traverse when left most variables are equal to types and last one is > or <

How to use?

Lets Create a Table:

create table student as select s as rollno, MOD(s,2)::text as class_id , s*random() as marks   from generate_Series(1,1000000) s;

Now lets create a index on class_id, roll_no,marks:

create index c_r_m_idx on student(class_id,rollno,marks);

Case 1 – Where all left indexes are equal to

explain analyze select * from student where class_id ='0' and rollno='89' and marks < 1;
                                                       QUERY PLAN                                                        
 Index Only Scan using c_r_m_idx on student  (cost=0.42..8.45 rows=1 width=14) (actual time=0.081..0.082 rows=0 loops=1)
   Index Cond: ((class_id = '0'::text) AND (rollno = 89) AND (marks < '1'::double precision))
   Heap Fetches: 0
 Planning time: 0.502 ms
 Execution time: 0.123 ms

Case 1 – Where we do not put equal to on left indexes

saarthi=# explain analyze select * from student where class_id ='0' and rollno>89 and marks < 1;
                                                       QUERY PLAN                                                       
 Gather  (cost=1000.00..13702.57 rows=49 width=14) (actual time=3.115..56.885 rows=5 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..12697.67 rows=20 width=14) (actual time=23.998..45.041 rows=2 loops=3)
         Filter: ((rollno > 89) AND (marks < '1'::double precision) AND (class_id = '0'::text))
         Rows Removed by Filter: 333332
 Planning time: 0.201 ms
 Execution time: 56.927 ms

Here you would see that in case where we are not putting equal to conditions we are getting Sequential Scans.

So be very careful on what type of queries you do.

Next Blog we will work on BRIN indexes and when to use Brin vs Btree.

Stay Tuned. Please Subscribe via email.