Understanding Postgresql Indexes For Beginners- Part -6 (Postgresql Stats)

In last blog we learned about partial indexes. In this blog lets see how we can take advantage of Postgresql Stats for optimizing our queries and find out on which tables there are no indexes.

Postgresql provide various statistics about the system and provides us with lots of information on how we can optimize our databases.

In this blog we will discuss on sequential scans on tables and how we should optimize them.

Query to find table with sequential scan:



SELECT
    schemaname, relname,
    seq_scan, seq_tup_read,
    seq_tup_read / seq_scan as avg_seq_tup_read
FROM pg_stat_all_tables
WHERE seq_scan > 0
ORDER BY 5 DESC LIMIT 50;
 schemaname |       relname       | seq_scan | seq_tup_read | avg_seq_tup_read 
------------+---------------------+----------+--------------+------------------
 public     | tickets             |        3 |     24000000 |          8000000
 public     | student             |       18 |      5600010 |           311111
 pg_catalog | pg_class            |    22200 |      6789240 |              305
 pg_catalog | pg_index            |       17 |          503 |               29
 pg_catalog | pg_database         |    10553 |       168788 |               15
 public     | provider            |     1109 |         7763 |                7
 pg_catalog | pg_namespace        |       35 |          208 |                5
 pg_catalog | pg_attribute        |       22 |           96 |                4
 pg_catalog | pg_amproc           |        4 |            8 |                2
 pg_catalog | pg_am               |     6679 |         6679 |                1
 pg_catalog | pg_opclass          |        4 |            4 |                1
 pg_catalog | pg_authid           |        2 |            2 |                1
 pg_toast   | pg_toast_159046     |        1 |            0 |                0
 pg_toast   | pg_toast_159120     |        1 |            0 |                0
 pg_catalog | pg_publication      |     1109 |            0 |                0
 pg_catalog | pg_subscription_rel |        2 |            0 |                0

Now from the above view we can see that on tickets table 3 seq scan ran and row count in our table is also (8000000), which means there are bad queries or bad usage of the system is done as system is sequentially searching on this table.

Now what we should do here is figure out what all queries are which runs on the system via the postgresql and logs and see what index we can create so that this query dosen’t do sequential scan.

Advertisement

Understanding Postgresql Indexes For Beginners- Part -5 (Partial Indexes)

In last blog we learned about Gin indexes. By now we have covered the basic indexes we use in Postgresql.

Now lets look at one of the ways by which we can optimize our indexes for business specific use case.

Partial Indexing

Partial Indexing means that we can create index on some subset of the actual rows of the table.

This type of indexing is very very useful in cases where queries are required on the subset very often.

Lets take a example:

We have a ticketing table(subject text, status varchar) which contains subject of tickets and status whether the ticket is OPEN or CLOSED. Now lets say in our case we primarily search on open tickets only and in general CLOSED Tickets would be increasing constantly.

Lets assume on our ticket table size – 100 million rows , out of which 5 million are OPEN and 95 million are closed.

Now if we create a index on entire table (100 million rows) our index would also be large and every time if we want to search on OPEN ticket only , still we are loading that huge index.

Now with Partial index we can create Gin (text search) only on OPEN ticket rows our index would only work on OPEN Tickets (only 5 million rows), in such case our index would be very small and efficient and fast to load in memory.

How to use

lets create table for tickets only:

create table tickets(subject  tsvector , status varchar);


INSERT INTO tickets ( subject , status) SELECT md5(random()::text)::tsvector,CASE WHEN g < 500000 THEN 'OPEN' ELSE 'CLOSED' END FROM generate_series(1,8000000) as g;

Now lets create a complete index and partial index:

create index t_s_idx on tickets using gin(subject);

Partial Index :

create index t_s_open_idx on tickets using gin(subject) where status='OPEN';

create index t_s_open_idx on tickets using gin(subject) where <CONDITION EXPRESSION>

Size diff. between complete and partial

\di+ t_s_open_idx 
                            List of relations
 Schema |     Name     | Type  |  Owner   |  Table  | Size  | Description 
--------+--------------+-------+----------+---------+-------+-------------
 public | t_s_open_idx | index | postgres | tickets | 40 MB | 
(1 row)

saarthi=# \di+ t_s_idx 
                          List of relations
 Schema |  Name   | Type  |  Owner   |  Table  |  Size  | Description 
--------+---------+-------+----------+---------+--------+-------------
 public | t_s_idx | index | postgres | tickets | 662 MB | 
(1 row)

It is clearly visible that index size is 662 MB vs 40 MB and in general as OPEN tickets will not be constantly increasing this index would be of same size and mostly can remain in memory.

When Partial index would work

Partial Index would work only when the query also contain the search condition which is present while creating the index.

Like in our ticketing case index has condition status =’OPEN’ and if the query also has the condition status=’OPEN’ only then it works.

Most index like Btree, Gin , BRIN can be used as a partial index. Using partial index is one of the most powerful tool to speed up your system.

Understanding Postgresql Indexes For Beginners- Part -4 (Gin)

In the last blog we learned about BRIN index parameters and when to choose it. In this we will learn basic of Gin index.

Gin index is the that can speed up your full text search, lets understand how.

Gin Index

Gin Index – (Generalized Inverted Index) – An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page. If you want more detail reading go to.

Now in case of Postgresql when we create any Gin index on a text column or the type of column is tsvector (A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word )

Gin does is take all the distinct lexemes and break them into trigrams and map to the rows it points to and it will point to multiple rows if multiple rows contains that lexemes.

Now in this way when we search the any three character in the column and gin index will return the result in fast manner.

Let’s see this in action:

First create a table:

create table student(name  text , address tsvector);

Now let’s add some data:

INSERT INTO student ( name , address) SELECT md5(random()::text), md5(random()::text)::tsvector FROM generate_series(1,800000) as g;

Now lets check for search in name (text) and address (ts_vector fields)

 explain analyze select * from student where name ilike '%puneet%';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..98897.03 rows=1887 width=64) (actual time=316.414..319.296 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..97708.33 rows=786 width=64) (actual time=307.882..307.883 rows=0 loops=3)
         Filter: (name ~~* '%puneet%'::text)
         Rows Removed by Filter: 266667
 Planning time: 0.089 ms
 Execution time: 319.353 ms



explain analyze select * from student where address @@ to_tsquery('gurgaon'); 
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..154940.76 rows=3956 width=74) (actual time=515.955..518.074 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..153545.16 rows=1648 width=74) (actual time=507.235..507.235 rows=0 loops=3)
         Filter: (address @@ to_tsquery('gurgaon'::text))
         Rows Removed by Filter: 266667
 Planning time: 0.125 ms
 Execution time: 518.102 ms
(8 rows)

Now you see its taking 391 ms and 518 ms

Let’s create indexes :

create index a_g_idx on student using gin(address);

CREATE EXTENSION IF NOT EXISTS pg_trgm;

create index name_g_idx on student using gin(name gin_trgm_ops);

Now let’s see the time:

explain analyze select * from student where address @@ to_tsquery('gurgaon'); 
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on student  (cost=55.25..13838.72 rows=4000 width=74) (actual time=0.067..0.068 rows=0 loops=1)
   Recheck Cond: (address @@ to_tsquery('gurgaon'::text))
   ->  Bitmap Index Scan on a_g_idx  (cost=0.00..54.25 rows=4000 width=0) (actual time=0.063..0.064 rows=0 loops=1)
         Index Cond: (address @@ to_tsquery('gurgaon'::text))
 Planning time: 0.472 ms
 Execution time: 0.140 ms


explain analyze select * from student where name ilike '%puneet%';
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on student  (cost=60.62..373.33 rows=80 width=74) (actual time=0.032..0.033 rows=0 loops=1)
   Recheck Cond: (name ~~* '%puneet%'::text)
   ->  Bitmap Index Scan on name_g_idx  (cost=0.00..60.60 rows=80 width=0) (actual time=0.029..0.030 rows=0 loops=1)
         Index Cond: (name ~~* '%puneet%'::text)
 Planning time: 6.547 ms
 Execution time: 0.083 ms

you see a huge difference in timings : from 319 ms to 0.14 and 518 to 0.083

Now we have seen the power of Gin Index in Postgresql and what text search in can provide.

Surely there are various parameters to configure and various other use cases of text search that can be done with gin.

Here our purpose was to understand its basics, in future we will work on details also.

In next blog we will work on Partial Indexs. Stay tuned and subscribe our blog for more such blogs.

Understanding Postgresql Indexes for Beginners – Part 4 (BRIN Indexes Optimization)

In last blog we learned about BRIN indexes , in this we will go in little detail on optimizing BRIN indexes and parameters for its optimization.

As we learned that BRIN index basically store the summary information (min and max ) for a particular page range(preferrably adjacent) such that it becomes easy to find which pages to cross check for a particular filter.

In this we will understand the BRIN index update mechanism and optimize the page range parameter.

When BRIN INDEX is updated

  • Index Creation
    • When the index is created it will create summary for all the pages of the table.
  • Table Update
    • For all page which are already summarized if any entry added / updated / removed , the index is updated in the transaction only
    • For new pages which are not summarized
      • Manual Update
        • By default BRIN does not auto update for this one needs to run the function –
          • select brin_summarize_new_values('<index name>'::regclass);
      • Auto Update
        • On autovacuum system will add the summary info for those blocks as well, we need to create index with auto vacuum on like this
          • create index c_t_brin_idx on calllog using brin(call_time) with (pages_per_range = 32,autosummarize = on);

How to optimize BRIN Indexes

Primary parameter to optimize BRIN indexes is the pages_per_range , this balance between no. of pages which needs to be scanned (because of lossy nature) vs no. of index entry created and index maintenance needs.

This parameter needs to be optimized based on individual needs , we will try to provide some guiding rules here, but before that we should do some math here and calculate the number of pages and rows per page:

Calculate no. of pages :

select relpages from pg_class where relname = 'student';;
 relpages 
----------
     5406

Now the number of pages is 5406

Now lets calculate total records :

select count(*) from student;
  count  
---------
 1000000

Now we have total rows as 1 million , lets count no. of record per page:

select 1000000/5406 as row_per_page;
 row_per_page 
--------------
          184

Now we see that row per page is 184 in our case, this will be diff. in you case so before proceeding check in your table.

and default value is 128 in page per range — so minimum values our index need to filter for every query is : 184*128 = 23552

Now lets try to see the rules, also rules needs to be balanced for best performance as in databases everything depends on your needs:

  • Table should be Large – something > 5 millions
    • if less data then benefit of small size of index would be less and Btree would also be small.
  • Data should primarily be INSERT Only
    • If we are constantly updating the data and also the data is updated in such a way that single value occur in multiple page range (eg: page 1-10 raneg is 100 -550 and page 11-20 range is 250-800), in this case the system needs to query both range pages which will slow down the filter as pages increase
  • Data should be inserted in IN ORDER
    • If we are inserting data in such a way that the column in which we are creating index that single value occur in multiple page range (eg: page 1-10 raneg is 100 -550 and page 11-20 range is 250-800), in this case the system needs to query both range pages which will slow down the filter as pages increase
  • Search record are generally in large numbers > 1000
    • if we are selecting single value Btree will generally out perform the BRIN , but if we are selecting the no. of records in some 1000’s , BRIN will start to perform better at large count.

Al in all one needs to be very clear in the above points before deciding to use BRIN.

In next blog we will move to Gin Indexes.

Subscribe our blog for more indepth tutorials.

Understanding Postgresql Indexes for Beginners – Part 3 (BRIN Index)

In last blogs we understood basics of Btree and Multi Column Btree. In this blog we will see Postgresql BRIN Indexes.

BRIN Index

Block Range Index – It is a revolutionary idea first proposed in 2015 by Alvaro.

The most fundamental difference in this index is rather than storing the actual values in the index and point to rows. It actually stores the range information about the pages in which the rows are stored.

For eg: Lets say we have a student table with 1000 roll no. and these rows are stored in 100 pages and pages 1-10 are adjacent then 10-20 and then 20-30 and so on. If we use BTree index it would create a tree of 1000 roll no values and no. of nodes say for eg in this tree would be 1000. But what BRIN would do is it will store the max value of roll no and min values of roll no for page ranges (lets say page range is of 10 pages) so BRIN index would store 1 values(min and max) for page range 1-20 then 1 values for 10-20 and so on effectively it would store only 10 values rather than 1000.

You see this huge diff. between the values stored in BRIN vs BTree. This marks for huge performance improvement for storing of indexes but i think you would have realized that this is a lossy index

Lossy What?

What we meant by lossy here, lets go by the example again as index is only storing roll no min and roll no max value in pages between 1-10 (which contains 100 roll no.) , now if you check for lest say roll no. – 10 and mix and max value for page 1-10 is 1 – 100 . Now as it is just min and max we are still not sure whether the values 10 exist in the pages or not. For this system needs to go to every row in pages 1-10 and check whether roll no exist or not. This is what we meant as lossy , means index is not confirmng whether the value exist or not.

When to use BRIN

If your data is such that which mostly insert only like logs or history kind of stuff and your business requirement is to query recent logs or some date range logs then it is great to use BRIN as index as this would drastically reduce size of index, index maintenance , you would generally be searching for a range.

Lets take a real life example, create a table with 100000 logs data:

CREATE TABLE calllog (call_time timestamp not null, call_result text , no_of_participant integer); 

INSERT INTO calllog ( call_time, no_of_participant, call_result) SELECT g, CURRENT_TIMESTAMP + ( g || 'minute' ) :: interval, random() * 10, md5(g::text) FROM generate_series(1,8000000) as g;


Now lets search for some cal logs between now() and now() – 5 hour:

explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Seq Scan on calllog  (cost=0.00..254764.68 rows=1 width=45) (actual time=1469.561..1469.561 rows=0 loops=1)
   Filter: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
   Rows Removed by Filter: 8000000
 Planning time: 0.280 ms
 Execution time: 1469.596 ms

Lets see what happens is we use BTree here :

create index ct_idx on calllog(call_time);


explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Scan using ct_idx on calllog  (cost=0.44..8.46 rows=1 width=45) (actual time=0.040..0.041 rows=0 loops=1)
   Index Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
 Planning time: 2.214 ms
 Execution time: 0.086 ms

You see a huge diff. from 1469 ms to 0.086 ms.

Now lets see what brin would do:

create index c_t_brin_idx  on calllog using brin(call_time) with (pages_per_range = 32,autosummarize = on);

drop  index ct_idx ;

 explain analyze  select * from calllog  where call_time  between now()  and  now() - interval '5 hour';
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on calllog  (cost=45.02..11385.57 rows=1 width=45) (actual time=2.259..2.260 rows=0 loops=1)
   Recheck Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
   ->  Bitmap Index Scan on c_t_brin_idx  (cost=0.00..45.02 rows=3423 width=0) (actual time=2.256..2.256 rows=0 loops=1)
         Index Cond: ((call_time >= now()) AND (call_time <= (now() - '05:00:00'::interval)))
 Planning time: 0.202 ms
 Execution time: 2.315 ms

You see 2.315 ms still very less than 1469 ms but greater than 0.086 ms.

Why BRIN over BTree when Btree is speedy??

  1. Size of BRIN vs BTree
\di+ c_t_brin_idx 
                             List of relations
 Schema |     Name     | Type  |  Owner   |  Table  |  Size  | Description 
--------+--------------+-------+----------+---------+--------+-------------
 public | c_t_brin_idx | index | postgres | calllog | 120 kB | 


 \di+ ct_idx 
                          List of relations
 Schema |  Name  | Type  |  Owner   |  Table  |  Size  | Description 
--------+--------+-------+----------+---------+--------+-------------
 public | ct_idx | index | postgres | calllog | 171 MB | 


You see there is huge difference in size and when there are lot of tables and indexes in your system such that you cannot have this 172MB in memory, those cases BRIN becomes very powerful

2. Index Maintenance – BTree indexes are costly to maintain as they would changes in every DML operation and has to be done in same transaction. While BRIN is put this offloading to vaccuming.

By Now i guess you understood lots of power of BRIN indexes and also the benefits over BTree.

You can ask your questions in the comment section we will try best to answer in your specific cases:

Next we will see parameters to optimize BRIN and then move to Gin Indexes.

Please subscribe for such indepth blogs.

Understanding Postgresql Indexes for Beginners – Part 2 (BTree MultiColumn )

This post is in continuation of Understanding Postgresql Indexes Series – Link for previous blog.

In last blog we understood about B-tree indexes and its use cases. Now we will look into Multi Column Postgresql Indexes. Lets Start :

Multi Column Index Btree :

In this we can include multiple columns into a single index eg: you have a student table with columns class_id , roll_no . With Multicolumn index we will be able to create a index (class_id,roll_no) rather than (class_id) or (roll_no).

The benefits of creating a multi column index is when we want to filter based on multiple column it works much faster. Understand it like for every value of class_id there is another btree of roll_no, now when you query like for class_id = ‘CLASS 8’ and roll_no > 10 then it will first go to ‘CLASS 8’ in this index and then from there there is a sub Btree index on roll_no which helps us to go to all roll_no > 8 very fast.

One of the important thing here is that it works best when we use ‘equal to’ constraint of the left most columns. eg: in our case our query was:

where class_id='CLASS 8' and roll_no > 10

Now in this case in our index :

create index cl_ro_idx on student(class_id,roll_no) 

In the index the left most column is class_id and we have put a equal to condition which is correct . Now lets say we have a index:

create index cl_ro_mar on student(class_id,roll_no,marks)

And we create a query which is like :

where class_id='CLASS_8' and roll_no > 10 and marks < 100

In this case lets see what happens , first the index is traversed for looking class_id = ‘CLASS 8’ same as last , not now we said roll_no > 10 for this all nodes for which roll_no was greater than 10 was traversed and also for every roll_no node there will be a tree of mark so all those trees are also traversed.

In generally it is preferred we traverse when left most variables are equal to types and last one is > or <

How to use?

Lets Create a Table:

create table student as select s as rollno, MOD(s,2)::text as class_id , s*random() as marks   from generate_Series(1,1000000) s;

Now lets create a index on class_id, roll_no,marks:

create index c_r_m_idx on student(class_id,rollno,marks);

Case 1 – Where all left indexes are equal to

explain analyze select * from student where class_id ='0' and rollno='89' and marks < 1;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using c_r_m_idx on student  (cost=0.42..8.45 rows=1 width=14) (actual time=0.081..0.082 rows=0 loops=1)
   Index Cond: ((class_id = '0'::text) AND (rollno = 89) AND (marks < '1'::double precision))
   Heap Fetches: 0
 Planning time: 0.502 ms
 Execution time: 0.123 ms

Case 1 – Where we do not put equal to on left indexes

saarthi=# explain analyze select * from student where class_id ='0' and rollno>89 and marks < 1;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..13702.57 rows=49 width=14) (actual time=3.115..56.885 rows=5 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..12697.67 rows=20 width=14) (actual time=23.998..45.041 rows=2 loops=3)
         Filter: ((rollno > 89) AND (marks < '1'::double precision) AND (class_id = '0'::text))
         Rows Removed by Filter: 333332
 Planning time: 0.201 ms
 Execution time: 56.927 ms

Here you would see that in case where we are not putting equal to conditions we are getting Sequential Scans.

So be very careful on what type of queries you do.

Next Blog we will work on BRIN indexes and when to use Brin vs Btree.

Stay Tuned. Please Subscribe via email.

Understanding Postgresql Indexes For Beginners- Part -1 (BTree)

In this series we will understand postgresql indexes and also when to use them with sample cases.

Also we will be comparing certain indexes to understand which one to use when.

Let’s Begin:

B-Tree

The most common , highly used index on Postgresql or in any Sql DB.

What btree does is create a self balancing tree (with multiple child nodes not 2) in which the leaf nodes point to the exact row location.

Also the Btree all leaf nodes pointers point to adjacent node using doubly linked list. (this primarily help in getting the sorted data in very fast as we just reach to first element and then traverse further will see this in action).

Lets see a diagram to understand: here the leaf nodes are marked in white and they points to each other and contain row address and you can see they are in sorted order.

How to use?

First create a table with dummy data to understand, here i am creating 1 million rows:

create table student as select s as rollno, md5(random()::text) as name  from generate_Series(1,1000000) s;

Use Case 1 : Query for equal

Now lets query this to understand Btree benefits:

explain analyze select * from student where rollno=9090;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..15375.79 rows=5292 width=36) (actual time=10.074..98.045 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..13846.59 rows=2205 width=36) (actual time=54.934..83.723 rows=0 loops=3)
         Filter: (rollno = 9090)
         Rows Removed by Filter: 333333
 Planning time: 0.197 ms
 Execution time: 98.084 ms
(8 rows)

Now create index:

create index roll_idx on student(rollno);

Lets try again:

explain analyze select * from student where rollno=9090;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Index Scan using roll_idx on student  (cost=0.42..8.44 rows=1 width=37) (actual time=0.106..0.108 rows=1 loops=1)
   Index Cond: (rollno = 9090)
 Planning time: 0.426 ms
 Execution time: 0.154 ms

Now you can see that total time taken after index is ~500 times faster than normal as in normal case it was doing a parallel scan(in older version of postgres it was also sequential scan).

Use Case 2: Query for range

explain analyze select * from student where rollno > 9090 and rollno < 10008;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Index Scan using roll_idx on student  (cost=0.42..41.52 rows=905 width=37) (actual time=0.017..0.605 rows=917 loops=1)
   Index Cond: ((rollno > 9090) AND (rollno < 10008))
 Planning time: 0.215 ms
 Execution time: 0.715 ms
(4 rows)

Now you could see this works well in case of range.

No no it is not checking for all the range value one by one , it is using the leaf property of pointing to each other in sorted order.

Use Case in case of string:

in case of string first create index and check :

create index name_idx on student(name);
explain analyze select * from student  where name = 'b170e15823193100056e09725aec94c4';
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Index Scan using name_idx on student  (cost=0.42..8.44 rows=1 width=37) (actual time=0.115..0.116 rows=0 loops=1)
   Index Cond: (name = 'b170e15823193100056e09725aec94c4'::text)
 Planning time: 0.390 ms
 Execution time: 0.151 ms

See this works perfectly well.

When to use?

  • When you are querying a table frequently on this column by filtering
  • If you want data sorted also in that order
  • Date , Integer , String are good candidates for this
  • DO NOT USE if you are searching ilike or like type of queries. There are diff. indexes for that(in next blogs)
  • DO NOT JUST CREATE A INDEX ON EACH COLUMN . This has a huge overhead for postgres (will discuss in series)

Next – Multi Column B-Tree

please subscribe for more such indepth blogs.

How and when and what to log ?

We as a developers understand the importance of coding but most of us do not know the right way to log.

While seeing others code you must have seen various patterns of logging:

  • Log everything
    • on start of any function call
    • on every function call
    • before every return statement
  • Don’t log anything
    • just don’t log
  • Log only Exceptions
    • Log only when exceptions occur

I think everybody must have seen all type of developers, lets understand when to log and also in diff. scenarios , but before that first see the purpose of logging:

Purpose of Logging

  • Provides insights about the flow of your code, what happen , what are the parameters which were passed
  • You could virtually run the entire flow of your code and understand the problems while debugging any unexpected behavior in your code.

Lets go scenario by scenario

In the following i want you see the reasoning to when to log that may help you to decide in your scenarios.

  • Http Service Call / Any third party
    • to debug any service hit , you will require the following:
      • parameters which are passed to it log using info, always check whether in your environment it is already done by framework library code (like in rest filters , tomcat logs) then do not log, generally its a good practice to handle this at framework level rather than everybody taking care of this
      • Now if you are writing pure functions (not using global variables, not modifying any variable provided) then mostly you would not require to log any thing in those functions(will be writing a blog on the pure functions), as you could drive most of the things from the input.
      • If any specific decision that will completely change the flow of the code , we should generally log the decision and reason in info
      • Log any exception which happens in to the system with type error
      • any parameter you are reading from system , generally log this with fine as you can cross check this , do not just log , not a hardline if some function is called limited number of times
      • calling another http service –
        • generally you can log the parameters here with info , it becomes easy to debug the parameters passed rather than looking into other service log if it is on other machine only and no central logging.
  • Timer or Scheduler Jobs
    • In case of timer first thing is , if timer is every some seconds log the time in which it started
    • rest more or less remain the same as in above
  • Service to Service(within same process, Spring Services) Calls
    • When service call to another service the calling service should not log generally assuming the responsibility is with the called service. rest remain same as above in http one
  • Any Http Connections
    • generally we should log the parameters passed to http with info , and also should check if some parameters not to print as they are internal and should not leak
  • Microservice to Microservice Communications
    • Assuming you would be using some trace id , if that Service logs are not available generally the request and response , keeping in mind do not print business critical parameters(like password of some sort).
  • DB Queries
    • before calling a query we should generally log with info the parameters we are passing
  • Now when to log Fatal
    • In code is something very unexpected which happens like db configuration not provided log with fatal or other api url is not configured but that should be without it use case will not work the log with fatal
  • Now when to log fine
    • Generally anytime when there are decisions taken in the code one can log on fine. The configurations read by the system we can log with fine, Some internal data structures created or modified or any lifecycle event like DB connection closed we can log such things with fine. But do not add lots and lots of log , it can kill your system.

Some of the cross cutting concerns we should know

  • Flow Id or Trace Id :
    • for any logs which are part of a single flow (ie a single thread doing it in sync ) there should be a flow id or trace id in all logs.
    • Even if the flow which are started from logically the same flow but then one async api is called (@Async in Spring) in that also one should have the same flow id passed with some additional info.
    • For distributed we you can check the link .
  • What to Log:
    • Use some trace id like user id , or object id for which a set of API or function call is done so that it becomes easy to figure out for which function was called

As we can’t cover all of them i think you the above examples help to choose the right mix.

Also please be sure when you log as lots of logging could kill your system. Do read the link.

Understand the intuition for databases

This article is for beginners who are starting their journey in software development and database looks like a jargon to them .

In this we will cover the intuition for databases , the lingo and types all will be covered in future posts.

Let’s begin by understanding through an example. Lets define our problem:

  • We are a online site which allows users to register on our website and registered users can see other registered users and send messages to them.
  • Now lets assume we are building this in an era where there are no database software available. So what will we do.

First understand our business requirement:

  • When user come on our site and register with user id and password , we should store this somewhere on a persistent storage (hard disk , available after restart).
  • We should be able to retrieve this user info from our persistent storage to validate when somebody comes for login with user id and password to check whether its correct or not.

Naive Solution 1 :

We will write user and their password in a file called "users" in the system in a sequential manner , lets look at the file:

Simple right, now if somebody want to read the userid and password , we just read the entire file check for each row , break it by semicolon ';' and check whether user id given matches the first part of the line (after breaking through semicolon) and second part matches the password provided and one.

But hey there are certain problems in the above structure (i know you have already guessed):

  • It will be very slow if user count is large (as one need to check all rows till the point user matches)
  • if somebody put semicolon in their user name( ahhh easy we wont allow it , but still there is an extra check)

So lets discuss again :

What we will do we will create multiple files for each alphabet ie user_a for user name starting with a , user_b for user name starting b and so on. this way we will reduce the size by a significant amount.

Hey but still if for a there are thousands of users still it will be a problem. Hmmm still stuck.

Lets think again:

Hey we will create a Binary Search tree type structure in File where each node points to the row on the user file (a pointer to exact location of that user in that file so that we can read that file directly), i am just showing a very very basic version for understanding down below:

Now when we do this , this solves problem to some level.

But there are so many other use cases which you will come through:

  1. delete some user – now the index is also needs to be managed
  2. backup of these files if hard disk corrupt
  3. if now i wants to store the user last name as well in this
  4. ……

There are very very few which i have written , the main point here is that all these things will be require by all companies , so database software try to solve the data persistence and query problems , so companies can focus on their business.

I think by this you got the basic intuition about the databases. Done – now go read about various type of databases which are available and what all purpose they do.

Want to be a better programmer – Read , Read , Read – But How?

If one wants to become a better programmer one thing is for sure that one needs to understand others code.

For beginners when we start writing the code, when using libraries very basic like List , Set , we get stuck in basic things like:

  • What this class do?
  • How to use this class?

Now this comes because we are missing one important trick:

  • First associate a purpose for the class be reading its definition and by name.
  • Now if you have the purpose you will be able to automatically make out what the functions in this class should be
  • Same goes for the functions first associate a purpose.

Now if you start reading the code by understanding the purpose of the class, you will be able to understand and use that class.

Now the next thing you should do is when you understood the purpose is :

  • look at their internal implementations
  • understand the data structure or variables they have declared
  • try to reason about the purpose why they have declared like that
  • Any pors / cons – alternative implementation you can think of.

I bet if one starts doing this in initial part of their programming career, this would help them:

  • to reason lots about other libraries
  • understand new libraries faster
  • also while writing new libraries one will be able to choose right data structures.

Lets see a demo of the above theory:

We all know there is Collections in java which looks somethings like this:

Java Collection Structure

Now lets start first by assigning purpose and see how things follow automatically :

  • Collection Interface – it says one can have some objects inside me.
    • so the functions should be in this class
      • one function to add a object
      • one function to remove a object
  • List Interface – it says its a collection but has a ordering for objects means you get in the order you put
    • so the extra functions in this class should be
      • get on a particular position get(int i)
      • add on a particular position
  • Set Interface – it also says it is a collection but it contains only unique objects and does not care about ordering
    • so the functions in this class
      • one is contains to check whether this already exist as it will not add a object twice

You see by just understanding the class purpose we could easily make out the functions , their purpose.

I am leaving the next part to you guys on looking at the internal implementation. Will make a next blog for this.