Elastic Stack Guide Part – 1

As nowadays lots of our servers are deployed on Cloud and many applications are running on these servers , it is impossible to monitor and analyze logs by going to each servers . Central Logging and Monitoring solution is a must in present time  . 

In this Bog Series , we will learn about usage of Elastic Stack aka ELK  . 

Overview  :   

Elastic Stack is a group of open source products from Elastic designed to help users take data from any type of source and in any format and search, analyze, and visualize that data in real time. The product group was formerly known as ELK Stack, in which the letters in the name stood for the products in the group: ElasticSearch, Logstash and Kibana. A fourth product, Beats, was subsequently added to the stack, rendering the potential acronym unpronounceable. Elastic Stack can be deployed on premises or made available as Software as a Service

Architechture : 

For a small-sized development environment, the classic architecture will look  as follows :

Here there are many different types of beats you can read them from https://www.elastic.co/beats/ . Each beat have different set of usecases  . 

In this blog we will learn about two beats MetricBeat and FileBeat . 

Note – LogStash is an options part in the architecture and should not be needed in most of the cases  . Read more about Logstash at https://www.elastic.co/logstash/

Usage Elastic Stack : 

I am running experiments on CentOS7 machine and using rpm to setup the elastic stack . 

Elastic Search  Installation : 

Commands to install Elastic Search : 

curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.14.0-x86_64.rpm
sudo rpm -i elasticsearch-7.14.0-x86_64.rpm
sudo service elasticsearch start

How to check if Elastic Search is running : 

[root@localhost elk]# curl http://127.0.0.1:9200
{
  "name" : "localhost.localdomain",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "MxKYDoJAQRG9D6krdFThsQ",
  "version" : {
    "number" : "7.14.0",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "dd5a0a2acaa2045ff9624f3729fc8a6f40835aa1",
    "build_date" : "2021-07-29T20:49:32.864135063Z",
    "build_snapshot" : false,
    "lucene_version" : "8.9.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

If you are getting output like above , it means elastic search is installed successfully  . 

Note : To change listen address and port you can change in the following file : /etc/elasticsearch/elasticsearch.yml

Kibana :   

Kibana is the Front end tool which communicates to Elastic search where anyone can monitor and analyze logs . 

Commands to install kibana : 

curl -L -O https://artifacts.elastic.co/downloads/kibana/kibana-7.14.0-linux-x86_64.tar.gz
tar xzvf kibana-7.14.0-linux-x86_64.tar.gz
cd kibana-7.14.0-linux-x86_64/
./bin/kibana

Access kibana from the url : 

http://127.0.0.1:5601/app/home#/

Note :  configure vim config/kibana.yml   for port and ip addressed for listening  settings  . 

Beats  

These will be installed on all servers from where we want to collect information  . they are like agents which will send data to Elastic Search  . 

Enabling Metric Beat : 

Every Beats supports different modules , it is up to the use that which module one wnts to enable in each beats  . if we talk about MetricBeat  it has many modules like System,Postgres,Nginx and so on . In this Blog we will see usage of System Module of MetricBeat . 

Commands to install MetricBeat  : 
curl -L -O https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-7.14.0-x86_64.rpm
sudo rpm -vi metricbeat-7.14.0-x86_64.rpm
Enabling System Module  of Metric Beat :
sudo metricbeat modules enable system
sudo metricbeat setup -e
sudo service metricbeat start

Here we are only enabling system module of metri beats  ,  there are many odule for basic monitoring of aplications like postgresql , nginx , tomcat etc . 

Fo list of modules available in metric beats :  command is 

metricbeat modules list  

Yeipeee  Now we can Monitor System Data in kibana  as follows . 

Open [Metricbeat System] Host overview ECS   in Dashboards in kibana UI . There you can apply filter of host of which one wants to see data  .   

System Module MetricBeat Uses :   What analysis can be Done by System module of MetricBeat : 

Traditionally after accessing linux servers , we gather system information by using many different commands and tools which also takes time , specially when there is some running issue on production . 

Following is the list of information : 

  1. Size information of all partitions 
  2. Read/Write Performance of Hardisk 
  3. InboundOutBound Traffic analysis per Ethernet Port 
  4. Load Avergae analysis of system 
  5. Top Proesses consuming High CPU  and RAM 

All these type of information now can be seen in seconds for some particular host using kibana UI .

Following are some screenshots  : 

Enabling FileBeat

Whether you’re collecting from security devices, cloud, containers, hosts, or OT, Filebeat helps you keep the simple things simple by offering a lightweight way to forward and centralize logs and files.

Commands to install Filebeat : 
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-7.14.0-x86_64.rpm
rpm -ivh filebeat-7.14.0-x86_64.rpm

Note : For configuring filebeat that where to send data to elastic search or filebeat  configue in /etc/filebeat/filebeat.yml  , cureent as i have only one machine so no need to do an conf as defaut conf will work for me You can check the following lion : https://www.elastic.co/guide/en/beats/filebeat/7.14/configuring-howto-filebeat.html

enabling system logs module in filebeat : 
filebeat modules enable system
(for system logs if we want to set custom paths :   edit the file /etc/filebeat/modules.d/system.yml)  -- Generally no need to change these config in all cases 

filebeat setup -e
sudo service filebeat start

Like  Metric Beat , FileBeats also have list of modules like postgres,nginx , and it also supports logging of popular framework like spring and can collect logs of these applications and provides ways to analyze them easily . 

To check modules list available for filebeat use following command  : 

[root@localhost elk]# filebeat modules list | less

System Module Filebeat Uses :   

Now you can use Kibana UI to analyze system logs like messages etc  . 

Open [Filebeat System] Syslog dashboard ECS  in Dashboard Tab  in Kibana  . 

Following are some screen shots which one can see : 

 

Configure filebeat for custom log files  : 

Now we may have situation where none of Modules and integration with framework logging work in filebeat for our custom application log then in that case you can configure your input manually to configure path of logs to read and analayse them in logs and stream section in kibana UI

Follow the following link to configure your log path : https://www.elastic.co/guide/en/beats/filebeat/current/configuration-filebeat-options.html

you can watch logs by : http://127.0.0.1:5601/app/logs/stream 

Here you can search in logs by hostname , filepath  and can also search in whole message which is fetched . 

By default only message column is shown . One can configure then in settings tabs of logs tabs in kibana  . 

Following are some screenshot : 

By Default logs lines are only one column , if for advance debugging we want to break log tine into columns then we need to use Logstash with Grok Filter . 

In next blog we will see the usage of LogStash to break custom logs into columns for better understanding . 

ThankYou all 

Usage of Practical grep commands examples useful in real world debugging in Linux

In our Daily debugging we need to analyze logs files of various products . Reading those log files are not an easy task , it requires special debugging skills which can only be gained through experience or by god’s grace . Now while debugging we might need to extract some of data or we need to play with a log file which can not be done by just reading , there is need for commands . 

There are many commands in linux which are used by debuggers like grep,awk,sed,wc,taskset,ps,sort,uniq,cut,xargs etc . . . 

In this blog we will see usage of Practical grep commands examples useful in real world debugging  in Linux . The examples which we will see in this blog are super basic but very useful in real life which a beginner should read to enhance the debugging skills . 

Let’s Go to the Practical Part 

  •  Grep the lines which contains some particular word
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer 
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# grep 'sahil' file1.log 
i am sahil 
sahil is a software engineer

  • Grep number of lines matched for a particualar word in a file
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer 
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# grep -c 'sahil' file1.log 
2

Another way :
[root@localhost playground]# grep 'sahil' file1.log  | wc -l
2

  • Grep all the lines in which contains some word in a file with case insensitive 
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer 
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# grep -i 'sahil' file1.log 
i am sahil 
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# 

  • Grep the lines in which either of two words are present in a file
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer 
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# grep  'sahil\|software' file1.log 
i am sahil 
i am software engineer 
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# 


  • Grep lines in which two words are present 
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer 
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# grep  'sahil' file1.log | grep 'software' 
sahil is a software engineer
[root@localhost playground]# ^C
[root@localhost playground]# 


  • Eliminate lines which contains some word  
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# grep -v 'sahil' file1.log
hello
i am software engineer
Sahil is a software engineer

Eliminate case insensitively 
[root@localhost playground]# grep -iv 'sahil' file1.log
hello
i am software engineer
[root@localhost playground]#

  • Matching the lines that start with a string
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer 
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# grep  '^sahil' file1.log 
sahil is a software engineer


  • Matching the lines that end with a string
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# grep  'engineer$' file1.log 
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]#

  • Getting n number of lines after each match 
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# 

[root@localhost playground]# grep  'hello' file1.log 
hello 

[root@localhost playground]# grep -A 1  'hello' file1.log 
hello 
i am sahil 

[root@localhost playground]# grep -A 2  'hello' file1.log 
hello 
i am sahil 
i am software engineer


  • Geeting n number of lines before each match 
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep  'i am sahil' file1.log 
i am sahil 
[root@localhost playground]# grep  -B 1 'i am sahil' file1.log 
hello 
i am sahil 
[root@localhost playground]# grep  -B 2 'i am sahil' file1.log 
hello 
i am sahil 
[root@localhost playground]# 


  • Grep n lines after and m lines before every match
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep -A 2  -B 1 'i am sahil' file1.log 
hello 
i am sahil 
i am software engineer
Sahil is a software engineer
[root@localhost playground]# 

  • Grep some word in more than one file in current directory  
[root@localhost playground]# cat file1.log 
hello 
i am sahil 
i am software engineer
Sahil is a software engineer
sahil is a software engineer

[root@localhost playground]# cat file2.log 
hello 
i am sahil  
i am tech blogger
Sahil is a tech blogger
sahil is a tech blogger

[root@localhost playground]# grep  'sahil' file1.log file2.log 
file1.log:i am sahil 
file1.log:sahil is a software engineer
file2.log:i am sahil  
file2.log:sahil is a tech blogger

  • Grep some word in all files in current directory
[root@localhost playground]# grep  'sahil' *
file1.log:i am sahil 
file1.log:sahil is a software engineer
file2.log:i am sahil  
file2.log:sahil is a tech blogger
[root@localhost playground]# 

  • Check how much lines matched in each file   
[root@localhost playground]# grep -c  'sahil' *
file1.log:2
file2.log:2
file.log:0

  • Grep using regular expression —
    • Suppose the content of files are as follows
[root@localhost playground]# cat file3.log 
time taken by api is 1211 ms 
time taken by api is 2000 ms 
time taken by api is 3000 ms 
time taken by api is 4000 ms 
time taken by api is 50000 ms 
time taken by api is 123 ms 
time taken by api is 213 ms 
time taken by api is 456 ms 
time taken by api is 1000 ms 

Now suppose we want to grep all the lines in which time taken by any api is more than 1 second or more than 1000 ms , it means it should have minimum 4 digit number .

Now grep command for this will be as follows :

[root@localhost playground]# grep -P '[0-9]{4} ms' file3.log
time taken by api is 1211 ms 
time taken by api is 2000 ms 
time taken by api is 3000 ms 
time taken by api is 4000 ms 
time taken by api is 50000 ms 
time taken by api is 1000 ms 

If want to get 5 digit number 

[root@localhost playground]# grep -P '[0-9]{5} ms' file3.log
time taken by api is 50000 ms 

  • Recursively grep in a directory  and sub directoies
[root@localhost playground]# grep -R 'sahil' .
./dir1/file.log:i am sahil 
./dir1/file.log:sahil is a software engineer
./file1.log:i am sahil 
./file1.log:sahil is a software engineer
./file2.log:i am sahil  
./file2.log:sahil is a tech blogger
[root@localhost playground]#

All above are basic use cases of grep . One can mix all the command options of grep to achieve the complex use cases and also one can also mix different grep commands using pipe operator to achieve complex use cases .

In future blogs i will explain some complex use cases and example how to achieve that using linux commands which can ease logs debugging .

Stay Tuned . . .

Tool for debugging Java Stack Traces – Untangle Threads

FastThreadAnalyzer is the tool which helps to debug you java stack traces.

It provides various insights about the stacks , club the similar stack together and provide you direction or in some cases the exact issue.

Some of the screenshots of the tools:

Threads of same thread pool grouped via thread state:

provides a clear view and you can deduce very easily that whether all threads of a particular pool should be busy or not

Now you see these threads group via similar stacks , which provides great insights on what type of flow are stuck:

Overall Thread State Wise Grouping:

Thread Dependencies:

very easily you can see because of which thread ids these threads are blocked.

There are various other type of grouping of stacks as well like based on length of stacks , deadlocks and so on.

If your org works on java that this should be the default tool to debug the application issues.

Load Average in Linux Servers – Confusion Solved

Regarding Load Average  shown in Linux there are many confusion around the world  like 

  • What is load average show in top command ?
  • What this values represent ? 
  • When it will be high when low ?
  • When to consider it as critical ? 
  • In which scenarios it can increase ? 

In this Blog we will talk about the answers of all these  . 

What are these three values shown in above image  ? 

The three numbers represent averages over progressively longer periods of time (one, five, and fifteen-minute averages)  and that lower numbers are better. Higher numbers represent a problem or an overloaded machine . 

Now before getting into what is good value, what is bad value , what are the reasons which can affect these values  , We will understand these on  a machine with one single-core processor. 

The traffic analogy

A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator … sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they’re in for delays.

So, Bridge Operator, what numbering system are you going to use? How about:

  • 0.00 means there’s no traffic on the bridge at all. In fact, between 0.00 and 1.00 means there’s no backup, and an arriving car will just go right on.
  • 1.00 means the bridge is exactly at capacity. All is still good, but if traffic gets a little heavier, things are going to slow down.
  • over 1.00 means there’s backup. How much? Well, 2.00 means that there are two lanes worth of cars total — one lane’s worth on the bridge, and one lane’s worth waiting. 3.00 means there are three lanes worth total — one lane’s worth on the bridge, and two lanes’ worth waiting. Etc.

Like the bridge operator, you’d like your cars/processes to never be waiting. So, your CPU load should ideally stay below 1.00. Also, like the bridge operator, you are still ok if you get some temporary spikes above 1.00 … but when you’re consistently above 1.00, you need to worry.

So you’re saying the ideal load is 1.00?

Well, not exactly. The problem with a load of 1.00 is that you have no headroom. In practice, many sysadmins will draw a line at 0.70:

But now a days we many multiple cores systems or multiple processors system  .

Got a quad-processor system? It’s still healthy with a load of 3.00.

On a multi-processor system, the load is relative to the number of processor cores available. The “100% utilization” mark is 1.00 on a single-core system, 2.00, on a dual-core, 4.00 on a quad-core, etc.

If we go back to the bridge analogy, the “1.00” really means “one lane’s worth of traffic”. On a one-lane bridge, that means it’s filled up. On a two-lane bridge, a load of 1.00 means it’s at 50% capacity — only one lane is full, so there’s another whole lane that can be filled.

Same with CPUs: a load of 1.00 is 100% CPU utilization on a single-core box. On a dual-core box, a load of 2.00 is 100% CPU utilization.

Which leads us to two new Rules of Thumb:

  • The “number of cores = max load” Rule of Thumb: on a multicore system, your load should not exceed the number of cores available.
  • The “cores is cores” Rule of Thumb: How the cores are spread out over CPUs doesn’t matter. Two quad-cores == four dual-cores == eight single-cores. It’s all eight cores for these purposes.

But What to extract from here that if load is going beyond number of Cores are we in crunch of Cores ? 

Well  Not exactly , For this We need to further debug or analyse TOP command data to come to a conclusion  . 

In above output , This coloured part shows CPU used by user process(us) and by system process(sy) . Now if these values are around 99-100%, it means there is crunch of cpu cores on your system or some process is consuming more CPU . So, in this case either increase cores or optimise you application which is consuming more CPU . 

Now let’s take another scenario : 

In above image  , coloured parts shows amount of time CPU is waiting in doing Input/Output(I/O) .  So say if this values is going above say 80% ,then also load average on server will increase  . It means either you disk read/write speed is slow  or your applications is reading/writing too much on your system beyond system capability . In this case either diagnose your hard disk read/write speed or check why your application is reading/writing so much . 

Now let’s take one more scenario : 

If values Above coloured output goes beyond certain limit , it means softirq(si) are consuming cpu  . it can be due to network/disk interrupts . Either they are not getting enough CPU   or there is some misconfiguration in you ethernet ports due to which  interrupts are not able to handle packets receiving or transmitting  . These types of problem occurs more on VM environment rather than physical machine  . 

Now , Lest take one last scenario : 

This above part will help you in case of Virtual Machine Environment  . 

If %st increases to certain limit say it is remaining more than 50% , it means that you are getting half of CPU time from base machine and someone else is consuming you CPU time as you may be on shared CPU infra  . In above case also Load can increase  . 

I hope it covers all major scenarios in which load average can increase  . 

Now there are many points open here like how to check read/write speed, how to check which application is consuming more CPU , How to check which interrupts are causing problem . For all of these stay tuned to hello worlds .

Refrences : https://scoutapm.com/blog/understanding-load-averages

Useful Postgresql Commands/Queries Part-1

Introduction

There are many types of queries and of course all are important , but when it comes to monitoring and debugging , we are generally in a hurry and do not remember what to query to check the problem/issue though . We at that time thinks what’s can we find and what will be the query for this , we then search on internet all somehow make some queries and then after issue we forgot . Here in this Blog i will brief some queries and its uses specially in case of Monitoring and Debugging Postgresql Related issue .

In case of Monitoring and Debugging , We generally have following category :

  • Size Related Issues
  • Performance Related issues

Sometimes These issues are interlinked

Size Related Issues

Following will be basic postgres commands/queries which will be very helpful in monitoring or debugging size related issues :

  • How to check Size of Database — — many times we need to check that which database is culprit i.e… which database is consuming major space . Following is query to check size of database in decreasing order according to size consumed
SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database ORDER BY pg_database_size(pg_database.datname) desc;
  • After that We may need to Find Which Table/Index is consuming the most space for this we have the following query :
SELECT
  nspname || '.' || relname AS "relation", pg_size_pretty(pg_relation_size(C.oid)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(C.oid) DESC
LIMIT 20;
  • Many Times There are Not much records in the Table But due to dead tuples size of table/index may increase , So to check which table has highest dead tuples following the the query :
  • First Goto the Database then :
SELECT schemaname,relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC limit 10;

Sample Output:

schemaname | relname | n_live_tup | n_dead_tup | last_vacuum | last_autovacuum

— — — — — — + — — — — — — — — — — — — — — — — — — — — — — — — — + — — — — — — + — — — — — — + — — — — — — — — — — — — — — — — — + — — — — — — — — — — — — — — — — —

public | campaign_session_denormalized_data | 1123219 | 114268349 | 2021–01–10 18:27:34.050087+05:30 | 2021–01–19 14:08:58.062574+05:30

Here , from above output, we can also determine if auto vacuum is running properly or not i.e.. When last auto vacuum ran on this any particular table whose dead tuples are high .

Overtime due to MVCC, your table will grow in size (called table bloat) — this is why regular VACUUM is needed. This query will show you list of tables and indexes with the most bloats.

The value represents the number of “wasted bytes”, or the difference between what is actually used by the table and index, and what we compute that it should be.

The way it works is it estimates the optimized size of the table/index by a calculation from each row sizes times total rows, and compare that against the actual table size. Do note that this is an estimate, not an actual figure.

with foo as (
  SELECT
    schemaname, tablename, hdr, ma, bs,
    SUM((1-null_frac)*avg_width) AS datawidth,
    MAX(null_frac) AS maxfracsum,
    hdr+(
      SELECT 1+COUNT(*)/8
      FROM pg_stats s2
      WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
    ) AS nullhdr
  FROM pg_stats s, (
    SELECT
      (SELECT current_setting('block_size')::NUMERIC) AS bs,
      CASE WHEN SUBSTRING(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
      CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
    FROM (SELECT version() AS v) AS foo
  ) AS constants
  GROUP BY 1,2,3,4,5  
), rs as (
  SELECT
    ma,bs,schemaname,tablename,
    (datawidth+(hdr+ma-(CASE WHEN hdr%ma=0 THEN ma ELSE hdr%ma END)))::NUMERIC AS datahdr,
    (maxfracsum*(nullhdr+ma-(CASE WHEN nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
  FROM foo  
), sml as (
  SELECT
    schemaname, tablename, cc.reltuples, cc.relpages, bs,
    CEIL((cc.reltuples*((datahdr+ma-
      (CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::FLOAT)) AS otta,
    COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
    COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::FLOAT)),0) AS iotta -- very rough approximation, assumes all cols
  FROM rs
  JOIN pg_class cc ON cc.relname = rs.tablename
  JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname AND nn.nspname <> 'information_schema'
  LEFT JOIN pg_index i ON indrelid = cc.oid
  LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid
)
SELECT
  current_database(), schemaname, tablename, /*reltuples::bigint, relpages::bigint, otta,*/
  ROUND((CASE WHEN otta=0 THEN 0.0 ELSE sml.relpages::FLOAT/otta END)::NUMERIC,1) AS tbloat,
  CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::BIGINT END AS wastedbytes,
  iname, /*ituples::bigint, ipages::bigint, iotta,*/
  ROUND((CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages::FLOAT/iotta END)::NUMERIC,1) AS ibloat,
  CASE WHEN ipages < iotta THEN 0 ELSE bs*(ipages-iotta) END AS wastedibytes
FROM sml
ORDER BY wastedbytes DESC

Query extracted but rewrote from checkpostgres

Sample Output:

current_database | schemaname | tablename | tbloat | wastedbytes | iname | ibloat | wastedibytes

— — — — — — — — — + — — — — — — + — — — — — — — — — — — — — — — — -+ — — — — + — — — — — — -+ — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -+ — — — — + — — — — — — —

dashboard | public | job_logs | 1.1 | 4139507712 | job_logs_pkey | 0.2 | 0

dashboard | public | job_logs | 1.1 | 4139507712 | index_job_logs_on_job_id_and_created_at | 0.4 | 0

dashboard | public | events | 1.1 | 3571736576 | events_pkey | 0.1 | 0

dashboard | public | events | 1.1 | 3571736576 | index_events_on_tenant_id | 0.1 | 0

dashboard | public | events | 1.1 | 3571736576 | index_events_on_event_type | 0.2 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_status | 0.0 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_tag | 0.3 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_tenant_id | 0.2 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_created_at | 0.2 | 0

dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_created_at_queued_or_running | 0.0 | 21086208

You should be looking at:

  • tbloat: table bloat, ratio between what it current is, and what it can be optimized to.
  • wastedbytes: number of bytes wasted
  • ibloat & wastedibytes: same as above, but for indexes.

When you see a table with high bloats, then consider running VACUUM ANALYZE on it,

Performance Related issues

For performance Related monitoring we need to run following queries too much in day to day scenario

  • Get Running Queries (And Lock statuses) in PostgreSQL
SELECT S.pid, age(clock_timestamp(), query_start),usename,query,L.mode,L.locktype,L.granted,s.datname FROM pg_stat_activity S inner join pg_locks L on S.pid = L.pid order by L.granted, L.pid DESC;

Above query many help if some particular query is running for hours and has taken some lock due to which others queries are stuck on lock . It also shows database name on which qurie is running . We can also determine if some connection are open and no query running so there may be some session leak in your database (for more for about session leak read following article)

  • To kill some particular query which is causing the issue following is the query :
SELECT pg_cancel_backend(pid);

This will send a SIGINT to the current process.

Hope It will help all of you in altleast some times in your debugging . Please comment and follow me.