Automatic indexes recommendations in PostgreSQL

In our last blog we learned about  Need and Usage of Hypothetical indexes in Postgresql  . Now we can check easily in live environment also if some particular index will be helpful or not , but  how we get to know which index to test  .  It requires in-depth knowledge of indexing and experience in Postgresql , But in PostgreSql we can get automatic recommendation of indexes for specific queries by using three extensions hypog,pg_stat_statements,pg_qualstats

Now Let’s move directly to the practical part how we can use  this feature in postgres  : 

i am doing experiments on Postgres10 installed on Centos7 .


yum install pg_qualstats10.x86_64
  • Change the following in postgresql.conf and restart postgresql 
shared_preload_libraries = 'pg_stat_statements, pg_qualstats' 
  • Create following extensions : 
testdb=# CREATE EXTENSION hypopg ;
testdb=# CREATE EXTENSION pg_stat_statements ;
testdb=# CREATE EXTENSION pg_qualstats;
  • Set sample rate of pgqual stats to 1 in postgresql.conf  . This rate define how frequently monitor  and analyze the queries . Value ‘1’ represents that keep track of all queries
pg_qualstats.sample_rate = '1'
  • Create the function which will be used to detect usable indexes : 
CREATE OR REPLACE FUNCTION find_usable_indexes()
    l_queries     record;
    l_querytext     text;
    l_idx_def       text;
    l_bef_exp       text;
    l_after_exp     text;
    hypo_idx      record;
    l_attr        record;
    /* l_err       int; */
    CREATE TABLE IF NOT EXISTS public.idx_recommendations (queryid bigint, 
    query text, current_plan jsonb, recmnded_index text, hypo_plan jsonb);
    FOR l_queries IN
    SELECT t.relid, t.relname, t.queryid, t.attnames, t.attnums, 
    pg_qualstats_example_query(t.queryid) as query
         SELECT qs.relid::regclass AS relname, qs.relid AS relid, qs.queryid, 
         string_agg(DISTINCT attnames.attnames,',') AS attnames, qs.attnums
         FROM pg_qualstats_all qs
         JOIN pg_qualstats q ON q.queryid = qs.queryid
         JOIN pg_stat_statements ps ON q.queryid = ps.queryid
         JOIN pg_amop amop ON amop.amopopr = qs.opno
         JOIN pg_am ON amop.amopmethod = pg_am.oid,
               SELECT pg_attribute.attname AS attnames
               FROM pg_attribute
               JOIN unnest(qs.attnums) a(a) ON a.a = pg_attribute.attnum 
               AND pg_attribute.attrelid = qs.relid
               ORDER BY pg_attribute.attnum) attnames,     
         LATERAL unnest(qs.attnums) attnum(attnum)
               WHERE NOT 
                       SELECT 1
                       FROM pg_index i
                       WHERE i.indrelid = qs.relid AND 
                       (arraycontains((i.indkey::integer[])[0:array_length(qs.attnums, 1) - 1], 
                        qs.attnums::integer[]) OR arraycontains(qs.attnums::integer[], 
                        (i.indkey::integer[])[0:array_length(i.indkey, 1) + 1]) AND i.indisunique)))
                       GROUP BY qs.relid, qs.queryid, qs.qualnodeid, qs.attnums) t
                       GROUP BY t.relid, t.relname, t.queryid, t.attnames, t.attnums                   
        /* RAISE NOTICE '% : is queryid',l_queries.queryid; */
        execute 'explain (FORMAT JSON) '||l_queries.query INTO l_bef_exp;
        execute 'select hypopg_reset()';
        execute 'SELECT indexrelid,indexname FROM hypopg_create_index(''CREATE INDEX on '||l_queries.relname||'('||l_queries.attnames||')'')' INTO hypo_idx;      
        execute 'explain (FORMAT JSON) '||l_queries.query INTO l_after_exp;
        execute 'select hypopg_get_indexdef('||hypo_idx.indexrelid||')' INTO l_idx_def;
        INSERT INTO public.idx_recommendations (queryid,query,current_plan,recmnded_index,hypo_plan) 
        VALUES (l_queries.queryid,l_querytext,l_bef_exp::jsonb,l_idx_def,l_after_exp::jsonb);        
    END LOOP;    
        execute 'select hypopg_reset()';
$$ LANGUAGE plpgsql;


  • Now Let’s make a table with 10 Crores rows as following : 
testdb=# create table orders as select s as orderno, md5(random()::text) as orderitem , now() as order_created from generate_Series(1,100000000) s;
SELECT 100000000
  • Now lets run a query on which we want to check if need to make any index  
testdb=# select * from orders where orderno = 80000  ;
-[ RECORD 1 ]-+---------------------------------
orderno       | 80000
orderitem     | 03b41c2f32d99e9a597010608946c4c6
order_created | 2021-05-22 17:52:21.435936+05:30
  • Now run following queries to find out  which indexes are recommend by this extension and what are the improvement percentage by applying these indexes hypothetically 

testdb=#           select find_usable_indexes();
(1 row)

testdb=# select b.query, a.recmnded_index,round((((a.current_plan->0->'Plan'->>'Total Cost')::numeric-(hypo_plan->0->'Plan'->>'Total Cost')::numeric)*100/(a.current_plan->0->'Plan'->>'Total Cost')::numeric),2) as percent_improvd FROM idx_recommendations a JOIN pg_stat_statements b ON a.queryid = b.queryid WHERE round((((current_plan->0->'Plan'->>'Total Cost')::numeric-(hypo_plan->0->'Plan'->>'Total Cost')::numeric)*100/(current_plan->0->'Plan'->>'Total Cost')::numeric),2) > 0 order by 3 desc ;
                        query                        |                          recmnded_index                          | percent_improvd 
 select * from orders where orderno = $1             | CREATE INDEX ON public.orders USING btree (orderno)              |          100.00

Above  analysis was internally done by creating the indexes hypothetically not by making real indexes .

Please Note here that do not fully rely on automatic index recommendation . Yes we have no doubt it is very very useful feature  , but please also check logically why these recommended indexes are useful and should you really create it or not. 

You can read PostgreSQL Index Tutorial Series for basic in depth understanding of indexes in postgresql . 

Now , You can try the explained feature with more complex queries and comment on this article explaining your result  with your queries . So let’s experiment and comment . 

Refrences :


Hypothetical Indexes in PostgreSQL – Need and Usage

In this Blog We will first cover what are hypothetical indexes and what is need of these type of indexes and then secondly  we shall see the practical usage of these indexes . 

Need of Hypothetical Indexes : 

As the name suggests   these are not real indexes,  these are hypothetical indexes i.e.. They are virtual indexes which PostgreSQL  query planner does not consider when running queries . 

 Now the question arises where these Hypothetical Indexes  are useful  ? ? ? 

First Let’s discuss one scenario  , we have a large table which is currently in the production environment  and we need to make some index on live db and we are not sure whether that index will be useful or not , we even don’t know if  by making that index our production environment may be down !!!

So , Solution of above problem will be following : 

  • Lets ignore the risk and make  the index on live table which can result the following scenario 
    • first of all it will take lots of time depending on data present in table 
    • Live queries may get affected badly if we are not sure if the index we are making will increase or decrease the cost .
    • We also do not know the size of the index it may be too large which can again impact the production database server . 
  • Another solution is to replicate the production database to the local dev environment and then apply all the hits and try there and then apply at the production environment .  it seems a very safe and effective approach in almost all cases but this will take too much of time in setting up the things and testing with actual scenario .  
  • Third Solution is Hypothetical Indexes  as this functionality will create imaginary indexes not real indexes .  But  there are some things to note about these indexes :
    •  it creates an index in our connection’s private memory. None of the catalog tables nor the original tables are actually touched. 
    • The only way to see if we can benefit from that index is by running an EXPLAIN <QUERY>.
    •  If you wish to run an EXPLAIN ANALYZE that runs that SQL and provides the run time stats, it would not be able to use that hypothetical index. This is because a hypothetical index does not exist in reality.
    • Currently these indexes will work on BTREE ONLY . However you can try if it works on other type of indexes .

Usage of Hypothetical indexes : 


I am using PostgreSQL10 on CentOS7 . 

  • Download hypopg by following command : 
  • Then install it on CentOS7 :
yum install hypopg_10-1.1.4-1.rhel7.x86_64.rpm
  • Now create extension using following query 
testdb=# CREATE EXTENSION hypopg ;
  • On Creating extension following functions will be created . 
testdb=#  select proname from pg_proc where proname ilike '%hyp%';


Now Let’s make a table with 10 Crores rows as following : 

testdb=# create table orders as select s as orderno, md5(random()::text) as orderitem , now() as order_created from generate_Series(1,100000000) s;
SELECT 100000000

Now check COST of a query by running explain : 

testdb=# explain select * from orders where orderno > 80000  order by order_created desc  limit 100 ;
                                   QUERY PLAN                                   
 Limit  (cost=3600088.98..3600089.23 rows=100 width=44)
   ->  Sort  (cost=3600088.98..3688095.27 rows=35202513 width=44)
         Sort Key: order_created DESC
         ->  Seq Scan on orders  (cost=0.00..2254674.25 rows=35202513 width=44)
               Filter: (orderno > 80000)
(5 rows)

Now create the Hypothetical Index : 

testdb=# SELECT * FROM hypopg_create_index('create index order_created_idx on orders(order_created)');
 indexrelid |             indexname             
      24797 | <24797>btree_orders_order_created
(1 row)

Now again do Explain to check if above index may be useful or not : 

testdb=# explain select * from orders where orderno > 80000  order by order_created desc  limit 100 ;
                                                          QUERY PLAN                                                           
 Limit  (cost=0.07..4.29 rows=100 width=45)
   ->  Index Scan Backward using "<24797>btree_orders_order_created" on orders  (cost=0.07..4215496.19 rows=99917459 width=45)
         Filter: (orderno > 80000)
(3 rows)

from both Explain command output we can clearly see the diffrence in cost and can also see that planner is using newly created hypothetical index .

We can Drop the index as follows : 

testdb=# select * from hypopg_drop_index(24797);
(1 row)

We can also check the estimated size of index created virtually as follows : 

testdb=# SELECT * FROM hypopg_create_index('create index order_created_idx on orders(order_created)');
 indexrelid |             indexname             
      24798 | <24798>btree_orders_order_created
(1 row)

testdb=# select * from pg_size_pretty(hypopg_relation_size(24798));
 2990 MB
(1 row)

Now lets create actual index and see what will be the actual size

testdb=# create index order_created_idx on orders(order_created);
testdb=# \di+ order_created_idx
                               List of relations
 Schema |       Name        | Type  |  Owner   | Table  |  Size   | Description 
 public | order_created_idx | index | postgres | orders | 2142 MB | 
(1 row)

As seen estimated and actual size is comparable .

I  Hope it clears the usage if hypothetical index in PostgreSQL .  In one of our blogs we learned about why index is not working  and also how to check on which tables index needed .  

In our future blogs we will talk about how you will get to know about exact index you need to make in database   . 

Stay Tuned to hello worlds . . . 

References :

Load Average in Linux Servers – Confusion Solved

Regarding Load Average  shown in Linux there are many confusion around the world  like 

  • What is load average show in top command ?
  • What this values represent ? 
  • When it will be high when low ?
  • When to consider it as critical ? 
  • In which scenarios it can increase ? 

In this Blog we will talk about the answers of all these  . 

What are these three values shown in above image  ? 

The three numbers represent averages over progressively longer periods of time (one, five, and fifteen-minute averages)  and that lower numbers are better. Higher numbers represent a problem or an overloaded machine . 

Now before getting into what is good value, what is bad value , what are the reasons which can affect these values  , We will understand these on  a machine with one single-core processor. 

The traffic analogy

A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator … sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they’re in for delays.

So, Bridge Operator, what numbering system are you going to use? How about:

  • 0.00 means there’s no traffic on the bridge at all. In fact, between 0.00 and 1.00 means there’s no backup, and an arriving car will just go right on.
  • 1.00 means the bridge is exactly at capacity. All is still good, but if traffic gets a little heavier, things are going to slow down.
  • over 1.00 means there’s backup. How much? Well, 2.00 means that there are two lanes worth of cars total — one lane’s worth on the bridge, and one lane’s worth waiting. 3.00 means there are three lanes worth total — one lane’s worth on the bridge, and two lanes’ worth waiting. Etc.

Like the bridge operator, you’d like your cars/processes to never be waiting. So, your CPU load should ideally stay below 1.00. Also, like the bridge operator, you are still ok if you get some temporary spikes above 1.00 … but when you’re consistently above 1.00, you need to worry.

So you’re saying the ideal load is 1.00?

Well, not exactly. The problem with a load of 1.00 is that you have no headroom. In practice, many sysadmins will draw a line at 0.70:

But now a days we many multiple cores systems or multiple processors system  .

Got a quad-processor system? It’s still healthy with a load of 3.00.

On a multi-processor system, the load is relative to the number of processor cores available. The “100% utilization” mark is 1.00 on a single-core system, 2.00, on a dual-core, 4.00 on a quad-core, etc.

If we go back to the bridge analogy, the “1.00” really means “one lane’s worth of traffic”. On a one-lane bridge, that means it’s filled up. On a two-lane bridge, a load of 1.00 means it’s at 50% capacity — only one lane is full, so there’s another whole lane that can be filled.

Same with CPUs: a load of 1.00 is 100% CPU utilization on a single-core box. On a dual-core box, a load of 2.00 is 100% CPU utilization.

Which leads us to two new Rules of Thumb:

  • The “number of cores = max load” Rule of Thumb: on a multicore system, your load should not exceed the number of cores available.
  • The “cores is cores” Rule of Thumb: How the cores are spread out over CPUs doesn’t matter. Two quad-cores == four dual-cores == eight single-cores. It’s all eight cores for these purposes.

But What to extract from here that if load is going beyond number of Cores are we in crunch of Cores ? 

Well  Not exactly , For this We need to further debug or analyse TOP command data to come to a conclusion  . 

In above output , This coloured part shows CPU used by user process(us) and by system process(sy) . Now if these values are around 99-100%, it means there is crunch of cpu cores on your system or some process is consuming more CPU . So, in this case either increase cores or optimise you application which is consuming more CPU . 

Now let’s take another scenario : 

In above image  , coloured parts shows amount of time CPU is waiting in doing Input/Output(I/O) .  So say if this values is going above say 80% ,then also load average on server will increase  . It means either you disk read/write speed is slow  or your applications is reading/writing too much on your system beyond system capability . In this case either diagnose your hard disk read/write speed or check why your application is reading/writing so much . 

Now let’s take one more scenario : 

If values Above coloured output goes beyond certain limit , it means softirq(si) are consuming cpu  . it can be due to network/disk interrupts . Either they are not getting enough CPU   or there is some misconfiguration in you ethernet ports due to which  interrupts are not able to handle packets receiving or transmitting  . These types of problem occurs more on VM environment rather than physical machine  . 

Now , Lest take one last scenario : 

This above part will help you in case of Virtual Machine Environment  . 

If %st increases to certain limit say it is remaining more than 50% , it means that you are getting half of CPU time from base machine and someone else is consuming you CPU time as you may be on shared CPU infra  . In above case also Load can increase  . 

I hope it covers all major scenarios in which load average can increase  . 

Now there are many points open here like how to check read/write speed, how to check which application is consuming more CPU , How to check which interrupts are causing problem . For all of these stay tuned to hello worlds .

Refrences :

Checklists – System is Hacked – Part 2 – Preventive Steps for Infra (OS Hardening)

In last article we described List of Checks which can determine if system is compromised or hacked .  In this article we will talk about preventive steps (specially infra related) can be taken care to avoid hacking or to make system more secure  . There are many directions in which we can secure our application  as follows : 

  • OS hardening (Infra Level Security)
  • Secure Coding guidelines
  • Encryption Of Sensitive Data  . 
  • Ensure No Vulnerability exists in system . 

In this Blog we will be concerned about OS hardening (Infra Level Security) in Linux systems(CentOS/Redhat). We will Cover Other parts in Future Blogs .

Now Let’s go to the System Part. It has following things to be taken care of : 

  • SSH Configuration :  
    • In linux based system SSH default port is 22 . This Defaut port should be changed to some unused port to enhance security .  
    • Use SSH Protocol 2 Version
    • Ensure SSH X11 forwarding is disabled
  • Port Configuration at Firewall :  Generally , in any application there are many applications running on set of servers and each running on some different ports , Say for example : 
    • Application server at  8080 port
    • Database Server at 5432 port

So,  as in above Case Users need to login through 8080 port so only this port should be opened for public as Database needs to interact generally with application server so 5432 port should be allowed from Application Server’s IP  . 

  • Multi Factor Authentication for SSH should be enabled   —  For setting up Google Authentication on CentOS or Redhat you can follow the link
  • Root login for any server must be disabled 
  • Server Login Policies 
    • Ensure password expiration is 365 days or less 
    • Ensure minimum days between password changes is 7 or more 
    • Ensure password expiration warning days is 7 or more 
    • Ensure inactive password lock is 30 days or less 
    • Ensure Password should be strong enough when user change its password
  • Application and Database should be on different Servers  :  this is because of that if due to some vulnerability  application hacked than acces to database in that case is protected  . 
  • Regular package updates   :  Configure Auto update or regularly update packages on all configured servers .
  • Tune Network Kernel Parameters :
    • IP forwarding should be disabled on all servers  
      • Do the following entry in sysctl.conf 
        • net.ipv4.ip_forward = 0
    • Packets Redirecting  should be diabled on all servers . 
      • Do the following entry in sysctl.conf 
        • net.ipv4.conf.all.send_redirects = 0
        • net.ipv4.conf.default.send_redirects = 0
  • Selinux should be enabled and configured . 
  • Antivirus must  be installed on all servers . 

All Above are basic minimum checklists which should be applied to all the servers in any production environment . For implementing in-depth OS Hardening specially for CentOS based Systems , one need to follow the latest CIS CentOS Benchmarklatest CIS

You can also check the below benchmark list from CIS for CentOS hardening : Below doc also explain how to implement things on CentOS .

For Other Operating Systems/Technologies follow the CIS benchmark link.  

In Our Future blog we will explain other parts like Secure Code guidelines , Encryption , VAPT scan etc  to make system more secure . 

Stay tuned . 

Tune Linux Kernel Parameters For PostgreSQL Optimization and better System Performance


In my previous Article i explained  Tuning PostgreSQL Database Memory Configuration Parameters to Optimize Performance and as i said  Database performance does not only depend on Postgresql configurations but also on system parameters .Poorly configured OS kernel parameters can cause degradation in database server performance. Therefore, it is imperative that these parameters are configured according to the database server and its workload. In this article  i will be talking about centos/redhat  linux system specially . 


I will start the article with small story where on one of our client huge amount of writes were there and customer have provided us 200 GB of RAM for that dedicated database server , So there were no problem of resources. 

Now what was happening that after sometime system loads get increased so much and on debugging we found no special query  around the time when load increases . Somewhere over internet we found if we clear the system cache  regularly then issue will be resolved . 

We then schedule a cron to clear system cache after some regular interval and issue got resolved . 

Now  the question is why issue was not coming after this ? ? ?

And the Answer is  that due to large cache size as we have so much of ram available   lots of data is collected in RAM (in GB’s) and when this whole data flushes out on to the disk   ,  system load becomes high at that time 

So from that we came to know that we also need to tune some system parameters also to optimize system and database(postgresql) performance . 

In above case we tuned vm.dirty_background_ratio and vm.dirty_ratio , these two system(os) parameters to resolve the issue . 

Kernel parameters Tuning

Now what values we set for these above two parameters described in story and what are all other  important Linux kernel parameters that can affect database server performance which we can tune are described as follows : 

vm.dirty_background_ratio / vm.dirty_background_bytes

The vm.dirty_background_ratio is the percentage of memory filled with dirty pages that need to be flushed to disk. Flushing is done in the background. The value of this parameter ranges from 0 to 100; however, a value lower than 5 may not be effective and some kernels do not internally support it. The default value is 10 on most Linux systems. You can gain performance for write-intensive operations with a lower ratio, which means that Linux flushes dirty pages in the background.

You need to set a value of vm.dirty_background_bytes depending on your disk speed.

There are no “good” values for these two parameters since both depend on the hardware. However, setting vm.dirty_background_ratio to 5 and vm.dirty_background_bytes to 25% of your disk speed improves performance by up to ~25% in most cases.

vm.dirty_ratio / dirty_bytes

This is the same as vm.dirty_background_ratio / dirty_background_bytes except that the flushing is done in the foreground, blocking the application. So vm.dirty_ratio should be higher than vm.dirty_background_ratio. This will ensure that background processes kick in before the foreground processes to avoid blocking the application, as much as possible. You can tune the difference between the two ratios depending on your disk IO


vm.swappiness is another kernel parameter that can affect the performance of the database. This parameter is used to control the swappiness (swapping pages to and from swap memory into RAM) behavior on a Linux system. The value ranges from 0 to 100. It controls how much memory will be swapped or paged out. Zero means disable swap and 100 means aggressive swapping.

You may get good performance by setting lower values.

Setting a value of 0 in newer kernels may cause the OOM Killer (out of memory killer process in Linux) to kill the process. Therefore, you can be on the safe side and set the value to 1 if you want to minimize swapping. The default value on a Linux system is 60. A higher value causes the MMU (memory management unit) to utilize more swap space than RAM, whereas a lower value preserves more data/code in memory.

A smaller value is a good bet to improve performance in PostgreSQL.

vm.overcommit_memory / vm.overcommit_ratio

Applications acquire memory and free that memory when it is no longer needed. But in some cases, an application acquires too much memory and does not release it.  This can invoke the OOM killer. Here are the possible values for vm.overcommit_memory parameter with a description for each:

  1. Heuristic overcommit, Do it intelligently (default); based kernel heuristics
  2. Allow overcommit anyway
  3. Don’t over commit beyond the overcommit ratio.


vm.overcommit_ratio is the percentage of RAM that is available for overcommitment. A value of 50% on a system with 2 GB of RAM may commit up to 3 GB of RAM.

A value of 2 for vm.overcommit_memory yields better performance for PostgreSQL. This value maximizes RAM utilization by the server process without any significant risk of getting killed by the OOM killer process. An application will be able to overcommit, but only within the overcommit ratio, thus reducing the risk of having OOM killer kill the process. Hence a value to 2 gives better performance than the default 0 value. However, reliability can be improved by ensuring that memory beyond an allowable range is not overcommitted. It avoids the risk of the process being killed by OOM-killer.

On systems without swap, one may experience a problem when vm.overcommit_memory is 2.

Generally speaking almost all applications which uses more memory depends on this , For example  , In Redis setting this value 1 is best . 

Turn On Huge Pages

Linux, by default, uses 4K memory pages, BSD has Super Pages, whereas Windows has Large Pages. A page is a chunk of RAM that is allocated to a process. A process may own more than one page depending on its memory requirements. The more memory a process needs the more pages that are allocated to it. The OS maintains a table of page allocation to processes. The smaller the page size, the bigger the table, the more time required to look up a page in that page table. Therefore, huge pages make it possible to use a large amount of memory with reduced overheads; fewer page lookups, fewer page faults, faster read/write operations through larger buffers. This results in improved performance.

PostgreSQL has support for bigger pages on Linux only. By default, Linux uses 4K of memory pages, so in cases where there are too many memory operations, there is a need to set bigger pages. Performance gains have been observed by using huge pages with sizes 2 MB and up to 1 GB. The size of Huge Page can be set boot time. You can easily check the huge page settings and utilization on your Linux box using cat /proc/meminfo | grep -i huge command.

Get HugePage Info – On Linux (only)

Note: This is only for Linux, for other OS this operation is ignored
$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

In this example, although huge page size is set at 2,048 (2 MB), the total number of huge pages has a value of 0. which signifies that huge pages are disabled.

Script to quantify Huge Pages

This is a simple script which returns the number of Huge Pages required. Execute the script on your Linux box while your PostgreSQL is running. Ensure that $PGDATA environment variable is set to PostgreSQL’s data directory.

Get Number of Required HugePages

pid=head -1 $PGDATA/
echo “Pid:            $pid”
peak=grep ^VmPeak /proc/$pid/status | awk '{ print $2 }'
echo “VmPeak:            $peak kB”
hps=grep ^Hugepagesize /proc/meminfo | awk '{ print $2 }'
echo “Hugepagesize:   $hps kB”
echo Set Huge Pages:     $hp

The output of the script looks like this:

Script Output

Pid:            12737
VmPeak:         180932 kB
Hugepagesize:   2048 kB
Set Huge Pages: 88

The recommended huge pages are 88, therefore you should set the value to 88.

Set HugePages Command :

sysctl -w vm.nr_hugepages= 88

Check the huge pages now, you will see no huge page is in use (HugePages_Free = HugePages_Total).

Again Get HugePage Info – On Linux (only)

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       88
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now set the parameter huge_pages “on” in $PGDATA/postgresql.conf and restart the server.

And Again Get HugePage Info – On Linux (only)

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       81
HugePages_Rsvd:       64
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now you can see that a very few of the huge pages are used. Let’s now try to add some data into the database.

Some DB Operations to Utilise HugePages

postgres=# CREATE TABLE foo(a INTEGER);
postgres=# INSERT INTO foo VALUES(generate_Series(1,10000000));
INSERT 0 10000000

Let’s see if we are now using more huge pages than before.

Once More Get HugePage Info – On Linux (only)

$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
HugePages_Total:      88
HugePages_Free:       18
HugePages_Rsvd:        1
HugePages_Surp:        0
Hugepagesize:       2048 kB

Now you can see that most of the huge pages are in use.

Note: The sample value for HugePages used here is very low, which is not a normal value for a big production machine. Please assess the required number of pages for your system and set those accordingly depending on your system’s workload and resources.

Now, Tuning Postgresql parameters and kernel parameters is not enough for good Postgresql performance there are many other things like

  • How you are making Query
  • Proper Indexing — For this you can follow indexing series on our blog
  • Proper partitioning and sharding accroding to business usecase
  • and many more .

Stay tuned to get more blogs on optimizing postgresql performance

Refrences :

Tuning PostgreSQL Database Memory Configuration Parameters to Optimize Performance


Out of the box, the default PostgreSQL configuration is not tuned for any particular workload. Default values are set to ensure that PostgreSQL runs everywhere, with the least resources it can consume and so that it doesn’t cause any vulnerabilities. It has default settings for all of the database parameters. It is primarily the responsibility of the database administrator or developer to tune PostgreSQL according to their system’s workload. In this blog, we will establish basic guidelines for setting PostgreSQL database parameters to improve database performance according to workload.

Bear in mind that while optimizing PostgreSQL server configuration improves performance, a database developer must also be diligent when writing queries for the application. If queries perform full table scans where an index could be used or perform heavy joins or expensive aggregate operations, then the system can still perform poorly even if the database parameters are tuned. It is important to pay attention to performance when writing database queries.

Nevertheless, database parameters are very important too, so let’s take a look at the eight that have the greatest potential to improve performance , But before that it is important to understand the memory architecture  basic of postgres  : 

Memory in PostgreSQL can be classified into two categories:

  1. Local Memory area: It is allocated by each backend process for its own use.
  2. Shared memory area: It is used by all processes of a PostgreSQL server.

So Now i will explain tunable parameters and what should be the value of these tunable parameters   : . 


PostgreSQL uses its own buffer and also uses kernel buffered IO. That means data is stored in memory twice, first in PostgreSQL buffer and then kernel buffer. Unlike other databases, PostgreSQL does not provide direct IO. This is called double buffering. The PostgreSQL buffer is called shared_buffer which is the most effective tunable parameter for most operating systems. This parameter sets how much dedicated memory will be used by PostgreSQL for cache.

The default value of shared_buffer is set very low and you will not get much benefit from that. It’s low because certain machines and operating systems do not support higher values. But in most modern machines, you need to increase this value for optimal performance.

The recommended value is 25% of your total machine RAM. You should try some lower and higher values because in some cases we achieve good performance with a setting over 25%. The configuration really depends on your machine and the working data set. If your working set of data can easily fit into your RAM, then you might want to increase the shared_buffer value to contain your entire database, so that the whole working set of data can reside in cache. That said, you obviously do not want to reserve all RAM for PostgreSQL.

In production environments, it is observed that a large value for shared_buffer gives really good performance, though you should always benchmark to find the right balance.

Alternatively, while a larger shared_buffers value can increase performance in ‘read heavy’ use cases, having a large shared_buffer value can be detrimental for ‘write heavy’ use cases, as the entire contents of shared_buffers must be processed during writes.

Please note that the database server needs to be restarted after this change.

testdb=# SHOW shared_buffers;
(1 row)


Write-Ahead Logging (WAL) is a standard method for ensuring integrity of data. Much like in the shared_buffers setting, PostgreSQL writes WAL records into buffers and then these buffers are flushed to disk.

The default size of the buffer is set by the  wal_buffers setting- initially at 16MB. If the system being tuned has a large number of concurrent connections, then a higher value for  wal_buffers can provide better performance.


effective_cache_size has the reputation of being a confusing PostgreSQL settings, and as such, many times the setting is left to the default value.

The effective_cache_size value provides a ‘rough estimate’ of the number of how much memory is available for disk caching by the operating system and within the database itself, after taking into account what’s used by the OS itself and other applications.

This value is used only by the PostgreSQL query planner to figure out whether plans it’s considering would be expected to fit in RAM or not. As such, it’s a bit of a fuzzy number to define for general use cases.

A conservative value for  effective_cache_size  would be ½(50%) of the total memory available on the system. Most commonly, the value is set to 75% of the total system memory on a dedicated DB server, but can vary depending on the specific discrete needs on a particular server workload.

If the value for effective_cache_size  is too low, then the query planner may decide not to use some indexes, even if they would help greatly increase query speed.

So Conclusively , General recommendation for effective_cache_size is as follows.

  • Set the value to the amount of file system cache available. On UNIX/Linux like systems, add the free+cached numbers from free or top commands to get an estimate
  • If you don’t know, set the value to the 50%  or 75%  of total system memory specially if dedicated DB server


This configuration is used for complex sorting. If you have to do complex sorting then increase the value of work_mem for good results. In-memory sorts are much faster than sorts spilling to disk. Setting a very high value can cause a memory bottleneck for your deployment environment because this parameter is per user sort operation. Therefore, if you have many users trying to execute sort operations, then the system will allocate work_mem * total sort operations  for all users. Setting this parameter globally can cause very high memory usage. So it is highly recommended to modify this at the session level.

12345678910testdb=# SET work_mem TO “2MB”;testdb=# EXPLAIN SELECT * FROM bar ORDER BY bar.b;                                    QUERY PLAN                                     ———————————————————————————–Gather Merge  (cost=509181.84..1706542.14 rows=10000116 width=24)   Workers Planned: 4   ->  Sort  (cost=508181.79..514431.86 rows=2500029 width=24)         Sort Key: b         ->  Parallel Seq Scan on bar  (cost=0.00..88695.29 rows=2500029 width=24)(5 rows)

The initial query’s sort node has an estimated cost of 514431.86. Cost is an arbitrary unit of computation. For the above query, we have a work_mem of only 2MB. For testing purposes, let’s increase this to 256MB and see if there is any impact on cost.

123456789testdb=# SET work_mem TO “256MB”;testdb=# EXPLAIN SELECT * FROM bar ORDER BY bar.b;                                    QUERY PLAN                                     ———————————————————————————–Gather Merge  (cost=355367.34..1552727.64 rows=10000116 width=24)   Workers Planned: 4   ->  Sort  (cost=354367.29..360617.36 rows=2500029 width=24)         Sort Key: b         ->  Parallel Seq Scan on bar  (cost=0.00..88695.29 rows=2500029 width=24)

The query cost is reduced to 360617.36 from 514431.86 — a 30% reduction.

So Conclusively , Setting the value Higher alway results beer sorting and hashing , But setting it in local queries is always recommended . Set value high in queries where you expect high sorting otherwise low global value is good . 


maintenance_work_mem is a memory setting used for maintenance tasks. The default value is 64MB. Setting a large value helps in tasks like VACUUM, RESTORE, CREATE INDEX, ADD FOREIGN KEY and ALTER TABLE.

It is necessary to remember that when autovacuum runs, up to autovacuum_max_workers times this memory may be allocated, so be careful not to set the default value too high.

The default value of maintenance_work_mem = 64MB.

General recommendation to set maintenance_work_mem is as follows.

  • Set the value 10% of system memory, up to 1GB
  • Maybe you can set it even higher if you are having VACUUM problems

We can also temporarily increase this memory while creating indexes or at the time of dump restores or while performing full vacuums . 


This is used to enforce that commit will wait for WAL to be written on disk before returning a success status to the client. This is a trade-off between performance and reliability. If your application is designed such that performance is more important than the reliability, then turn off synchronous_commit. This means that there will be a time gap between the success status and a guaranteed write to disk. In the case of a server crash, data might be lost even though the client received a success message on commit. In this case, a transaction commits very quickly because it will not wait for a WAL file to be flushed, but reliability is compromised.


This parameter sets the maximum number of temporary buffers used by each database session. The session local buffers are used only for access to temporary tables. The setting of this parameter can be changed within individual sessions but only before the first use of temporary tables within the session.

PostgreSQL database utilizes this memory area for holding the temporary tables of each session, these will be cleared when the connection is closed.

The default value of temp_buffer = 8MB.


There are more parameters that can be tuned to gain better performance but those have less impact than the ones highlighted here. In the end, we must always keep in mind that not all parameters are relevant for all applications types. Some applications perform better by tuning a parameter and some don’t. Tuning PostgreSQL Database Parameters must be done for the specific needs of an application and the OS it runs on.  

Also ,  Performance tuning does not only depends on postgres configuration parameters ,  there are many system parameters also on which postgresql performance depends So, in my next blog i will explain some system parameters which can affect postgresql performance .

Checklists – System is Compromised or Hacked – Part 1


As in my previous Blog where i explained how i came to know if my system is hacked or compromized (link here). Here in this blog i will explain what basic things we can check on our system when we have doubt if our system is compromized .

This Blogs have 3 parts

  • List of Checks which can determine if system is compromised or hacked – Part 1
  • List of checks which can give a direction how system is compromised or hacked – Part 2
  • What preventive steps (specially infra related) can be taken care to avoid hacking or to make system more secure – Part 3

Here , i am assuming system is Linux system with Centos installed .

List of Checks which can determine if system is compromised or hacked

  • Generally when hacker break into a linux system it is high chance that it will alter you main packages like openssh,kernel etc.. , So first if of please check if these packages are altered or there are some changes in the files or binaries provided by these packages . Following are commands to check on Centos
    • sudo rpm -qa | grep openssh | xargs -I '{}' sudo rpm -V '{}'
    • If therr are files shown by above command in which you did not change anything then it means there is high chance your system is compromised
  • Run rootkit Hunter to check if you system is compromised
    • Download rkhunter.tar.gz
    • copy it in /root and goto /root
    • tar zxvf rkhunter-1.4.2.tar.gz
    • cd rkhunter-1.4.2/
    • sh --layout default --install
    • changes in /etc/rkhunter.conf ENABLE_TESTS="all" DISABLE_TESTS="none" HASH_CMD=SHA1 HASH_FLD_IDX=4 PKGMGR=RPM 7
    • /usr/local/bin/rkhunter --propupd
    • /usr/local/bin/rkhunter --update
    • /usr/local/bin/rkhunter -c -sk 10.
    • note output or check and copy /var/log/rkhunter.log
    • you can also check the link for using rkhunter
  • Check /var/log/secure to check if there are many authentication failure requests and someone trying brute force to enter in to system
    • following will be the comand :
      • [root@localhost ~]# less /var/log/secure | grep 'authentication failures'
    • and output will be something like :
      • Apr 25 12:48:46 localhost sshd[2391]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost= user=root
      • Apr 25 12:49:33 localhost sshd[2575]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost= user=root
    • In above output you can see the rhost from where login attempt are made . If you see lots of entries like this then also check if at some point of time some login attempt will be successful from any of attempting rhosts . In secure logs accepted logs will looks something like as follows :
      • Apr 25 12:53:10 localhost sshd[3551]: Accepted password for root from port 36362 ssh2

  • Check in Processes if some unusual process is running and consuming high CPU using top and ps commands .
    • Command to list all process running in system : ps aux | less
    • Also check using top command if some unusual process trying to utiize high cpu
  • Check if there is some unusual entry in crontab of all users made on system
    • crontab -u <user> -l . by default user is root
  • Check if in , if some attacker has somehow made its entry in .ssh folder in every users’s home directory .

This was the Part 1 of the Blog , In later Parts i will explain some further checklist to ensure that you system will remain less hackable .

Thankyou .

Linux Machine Compromized/Broken -Power of Observation

Introduction :

In Debugging any issue or any dealing any problem or circumstance two things are important

  • Observation — Observation not only at the time of issue but in general times also .
  • Combining your general observations and Observations at the time of issue to conclude something .

In this Blog , I will explain the following :

  • What was happening on my machine
  • How i came to know my machine is broken into — Power of observation

What was happening on my machine

  • Load on my machine is going very high
  • On top command one process ./kwsapd0 is consuming around 3000% cpu

From here we get to know that kswapd is consuming process , The process kswapd0 is the process that manages virtual memory . So I thought that may be our some process is consuming more RAM and Virtual Memory is being used due to which kswapd process is doing its work but after hours of debugging we found no process is consuming RAM and around 80% RAM was free .

How i came to know my machine is broken into — Power of observation

There were two general observation which i observed and helped my geeting know what was the issue

  • 1st is kswapd process looks in top command like [kswapd] not ./kswapd
  • Kswapd0 can only consume 100% as it uses only one core in the machine .

From there I got to know that this kswapd0 is something unusual . On further debugging

I found ./.configrc/a/kswapd0 in root users directory .

Contents of this directory was :

$ find .configrc -type f

There was also an entry in cron to run this .

So , From all of this i got to know that my system was compromised .

Yet I was unable to find out how my system was broken into . But in my future Blog i will explain what things one can check if your system is compromised and how it is compromised and what all security we can apply to our system to make it less hackable .

Useful Postgresql Commands/Queries Part-1


There are many types of queries and of course all are important , but when it comes to monitoring and debugging , we are generally in a hurry and do not remember what to query to check the problem/issue though . We at that time thinks what’s can we find and what will be the query for this , we then search on internet all somehow make some queries and then after issue we forgot . Here in this Blog i will brief some queries and its uses specially in case of Monitoring and Debugging Postgresql Related issue .

In case of Monitoring and Debugging , We generally have following category :

  • Size Related Issues
  • Performance Related issues

Sometimes These issues are interlinked

Size Related Issues

Following will be basic postgres commands/queries which will be very helpful in monitoring or debugging size related issues :

  • How to check Size of Database — — many times we need to check that which database is culprit i.e… which database is consuming major space . Following is query to check size of database in decreasing order according to size consumed
SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database ORDER BY pg_database_size(pg_database.datname) desc;
  • After that We may need to Find Which Table/Index is consuming the most space for this we have the following query :
  nspname || '.' || relname AS "relation", pg_size_pretty(pg_relation_size(C.oid)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(C.oid) DESC
  • Many Times There are Not much records in the Table But due to dead tuples size of table/index may increase , So to check which table has highest dead tuples following the the query :
  • First Goto the Database then :
SELECT schemaname,relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC limit 10;

Sample Output:

schemaname | relname | n_live_tup | n_dead_tup | last_vacuum | last_autovacuum

— — — — — — + — — — — — — — — — — — — — — — — — — — — — — — — — + — — — — — — + — — — — — — + — — — — — — — — — — — — — — — — — + — — — — — — — — — — — — — — — — —

public | campaign_session_denormalized_data | 1123219 | 114268349 | 2021–01–10 18:27:34.050087+05:30 | 2021–01–19 14:08:58.062574+05:30

Here , from above output, we can also determine if auto vacuum is running properly or not i.e.. When last auto vacuum ran on this any particular table whose dead tuples are high .

Overtime due to MVCC, your table will grow in size (called table bloat) — this is why regular VACUUM is needed. This query will show you list of tables and indexes with the most bloats.

The value represents the number of “wasted bytes”, or the difference between what is actually used by the table and index, and what we compute that it should be.

The way it works is it estimates the optimized size of the table/index by a calculation from each row sizes times total rows, and compare that against the actual table size. Do note that this is an estimate, not an actual figure.

with foo as (
    schemaname, tablename, hdr, ma, bs,
    SUM((1-null_frac)*avg_width) AS datawidth,
    MAX(null_frac) AS maxfracsum,
      SELECT 1+COUNT(*)/8
      FROM pg_stats s2
      WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
    ) AS nullhdr
  FROM pg_stats s, (
      (SELECT current_setting('block_size')::NUMERIC) AS bs,
      CASE WHEN SUBSTRING(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
      CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
    FROM (SELECT version() AS v) AS foo
  ) AS constants
  GROUP BY 1,2,3,4,5  
), rs as (
    (datawidth+(hdr+ma-(CASE WHEN hdr%ma=0 THEN ma ELSE hdr%ma END)))::NUMERIC AS datahdr,
    (maxfracsum*(nullhdr+ma-(CASE WHEN nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
  FROM foo  
), sml as (
    schemaname, tablename, cc.reltuples, cc.relpages, bs,
      (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::FLOAT)) AS otta,
    COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
    COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::FLOAT)),0) AS iotta -- very rough approximation, assumes all cols
  FROM rs
  JOIN pg_class cc ON cc.relname = rs.tablename
  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname AND nn.nspname <> 'information_schema'
  LEFT JOIN pg_index i ON indrelid = cc.oid
  LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid
  current_database(), schemaname, tablename, /*reltuples::bigint, relpages::bigint, otta,*/
  ROUND((CASE WHEN otta=0 THEN 0.0 ELSE sml.relpages::FLOAT/otta END)::NUMERIC,1) AS tbloat,
  CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::BIGINT END AS wastedbytes,
  iname, /*ituples::bigint, ipages::bigint, iotta,*/
  ROUND((CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages::FLOAT/iotta END)::NUMERIC,1) AS ibloat,
  CASE WHEN ipages < iotta THEN 0 ELSE bs*(ipages-iotta) END AS wastedibytes
FROM sml
ORDER BY wastedbytes DESC

Query extracted but rewrote from checkpostgres

Sample Output:

current_database | schemaname | tablename | tbloat | wastedbytes | iname | ibloat | wastedibytes

— — — — — — — — — + — — — — — — + — — — — — — — — — — — — — — — — -+ — — — — + — — — — — — -+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+ — — — — + — — — — — — —

dashboard | public | job_logs | 1.1 | 4139507712 | job_logs_pkey | 0.2 | 0

dashboard | public | job_logs | 1.1 | 4139507712 | index_job_logs_on_job_id_and_created_at | 0.4 | 0

dashboard | public | events | 1.1 | 3571736576 | events_pkey | 0.1 | 0

dashboard | public | events | 1.1 | 3571736576 | index_events_on_tenant_id | 0.1 | 0

dashboard | public | events | 1.1 | 3571736576 | index_events_on_event_type | 0.2 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_status | 0.0 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_tag | 0.3 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_tenant_id | 0.2 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_created_at | 0.2 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_created_at_queued_or_running | 0.0 | 21086208

You should be looking at:

  • tbloat: table bloat, ratio between what it current is, and what it can be optimized to.
  • wastedbytes: number of bytes wasted
  • ibloat & wastedibytes: same as above, but for indexes.

When you see a table with high bloats, then consider running VACUUM ANALYZE on it,

Performance Related issues

For performance Related monitoring we need to run following queries too much in day to day scenario

  • Get Running Queries (And Lock statuses) in PostgreSQL
SELECT, age(clock_timestamp(), query_start),usename,query,L.mode,L.locktype,L.granted,s.datname FROM pg_stat_activity S inner join pg_locks L on = order by L.granted, DESC;

Above query many help if some particular query is running for hours and has taken some lock due to which others queries are stuck on lock . It also shows database name on which qurie is running . We can also determine if some connection are open and no query running so there may be some session leak in your database (for more for about session leak read following article)

  • To kill some particular query which is causing the issue following is the query :
SELECT pg_cancel_backend(pid);

This will send a SIGINT to the current process.

Hope It will help all of you in altleast some times in your debugging . Please comment and follow me.

A Database Session Leak Can Slow Down Your Database

In this article, I will explain some consequences of session leaks, which I faced when an issue came to me.

Session Leak is very common in developers’ lives. In this article, I will explain some consequences of session leaks, which I faced when an issue came to me and how I came to the root cause of the issue.

Issue and Analysis

IssueLoad is increasing on the server and Postgres queries are consuming CPU and taking time.

I had to resolve this issue and find the root cause of it.

Analysis: After all my debugging, I came to the following analysis:

There is a table that has 200 rows, but the number of live tuples showing there is more than that (around 60K). We are using Postgresql 9.3.

The following are the queries that I ran.

select count(*) from subscriber_offset_manager; 
200 (1 row) 

SELECT schemaname,relname,n_live_tup,n_dead_tup FROM pg_stat_user_tables where relname='subscriber_offset_manager' ORDER BY n_dead_tup ; 
schemaname | relname | n_live_tup | n_dead_tup 
public | subscriber_offset_manager | 61453 | 5 (1 row)

But as seen from pg_stat_activity and pg_locks, we are not able to track any open connection.

SELECT query, state,locktype,mode FROM pg_locks JOIN pg_stat_activity USING (pid) WHERE relation::regclass = 'subscriber_offset_manager'::regclass ; 
query | state | locktype | mode 
(0 rows)

I also tried full vacuum on this table. Below were the results:

  • All the times no rows were removed
  • Many times all the live tuples become dead tuples.

Here is the output of the running vacuum command:

vacuum FULL VERBOSE ANALYZE subscriber_offset_manager; 
INFO: vacuuming "public.subscriber_offset_manager" 
INFO: "subscriber_offset_manager": found 0 removable, 67920 nonremovable row versions in714 pages 
DETAIL: 67720 dead row versions cannot be removed yet. CPU 0.01s/0.06u sec elapsed 0.13 sec. 

INFO: analyzing "public.subscriber_offset_manager" 
INFO: "subscriber_offset_manager": scanned 710 of 710 pages, containing 200 live rows and67720 dead rows; 200 rows in sample, 200 estimated total rows VACUUM 
after that i checked for live and dead tuples for that table as follows : 

SELECT schemaname,relname,n_live_tup,n_dead_tup FROM pg_stat_user_tables where relname='subscriber_offset_manager' ORDER BY n_dead_tup 

schemaname | relname | n_live_tup | n_dead_tup 
public | subscriber_offset_manager | 200 | 67749

After 10 seconds:

SELECT schemaname,relname,n_live_tup,n_dead_tup FROM pg_stat_user_tables where relname='subscriber_offset_manager' ORDER BY n_dead_tup ;

schemaname | relname | n_live_tup | n_dead_tup
public | subscriber_offset_manager | 68325 | 132

All the dead tuples moved to live tuples instead of cleaning up.

One more interesting observation: When I stop my Java app and then do a full vacuum, it works fine (number of rows and live tuples become equal). So there is something wrong if we select and update continuously from the Java app.

After all the research and analysis and help for stack overflow and after following many links, I found the following root cause.

Root Cause:

When there is one long-running transaction or a database session leak, then dead tuples are created after the start time of that transaction and will not be cleaned up by the vacuum for all the tables for that database. This is due to the PostgreSQL vacuum process that checks for a transaction ID less than the transaction ID of the oldest transaction for cleaning dead rows. The transaction ID is generated globally.

When I checked, I found a transaction that was opened for too long and when I killed it, the vacuum worked fine.

Please read the below-given links for detailed info and effects of not ending database transaction and some Postgres internals. These links helped me a lot in solving the issue.

1.Question asked by me on StackOverFlow:

2. When Postgresql Vacuum does not Work:

3. Consequences of not ending a database transaction:

4.Some Postgres internals and how the vacuum works internally in Postgres