In last blog we learned about diff. thread states in java, in this blog we will understand how to use that knowledge to debug application issues using fasthreadanalyzer tool.
Lets take one type of issue we see with Java Applications:
Application Unresponsive Issue
When we say application in unresponsive<it can mean different things to diff. people>, here we mean that application is not responding to external api calls.
Lets take an example of a Spring Boot application not responding to the Http API Calls. There can be several reasons to it:
Total Http Thread in tomcat (or whatever container spring is using) are consumed
Causes :
It could be because of some high cpu consuming work is done in those threads and all those threads are stuck doing that work <now to connect with thread states – those threads consuming cpu would be in RUNNABLE state , so we should be looking for lots of RUNNABLE state thread in the jstack>
It could be because of those threads are waiting on some external IO <now to connect with thread states – those threads are logically stuck on some IO to complete, means those threads would be in WAITING/BLOCKED state, we should be looking for threads with such states>
How to debug via jstack:
Take multiple jstacks
Now in every jstack to understand the what diff. thread are doing in which states they are stuck we will use the tool https://fastthreadanalyzer.com/. Just upload your jstack in this tool.
They will show you a table like this <thread group (generally http threads are part of group names http-bio-443-exec) vs count (total count of threads in this group) vs thread states(count of each thread state for these threads)>:
Now to make sense of the information above we will first see whether our http threads are available or not
if thread count on tool and thread count configured in tomcat.conf (or any other container configuration) ir equal to max thread count , that means all http threads are doing something and new requests cannot be processed
Yes
Now if stuck we will see what they are doing by seeing their thread states if most of them on running
means something in you application is taking long time to complete or system is bombarded with many http calls
Now if you see these threads are stuck on waiting/timed waiting/blocked that means most probably these threads are doing some IO and waiting on it
No
Some other issue may be related to JVM
Now to dig further exactly where the threads are waiting or stuck , you can click on the corresponding thread group and system will show what those threads are doing group by similar stack and their thread state, eg:
Now you can see the threads grouped by State and Stack , using this information you can figure out which service in the application is actually consuming the threads.
There could be many other reasons like JVM stuck , machine unresponsive we are not going in detail for them.
With the fast thread tool you can debug many such issues, we will cover more diff. type of issues in future posts.
The purpose of this blog series is to learn how to debug your java application issues, for this firstly we will understand what different thread states are there.
Lets understand what all are the different states of java stack :
In the above diagram you can 6 states of Java Thread:
New : When a new thread is created, it is in the new state. The thread has not yet started to run when thread is in this state. When a thread lies in the new state, it’s code is yet to be run and hasn’t started to execute.
Runnable/Running : A thread that is ready to run is moved to runnable state. In this state, a thread might actually be running or it might be ready run at any instant of time. It is the responsibility of the thread scheduler to give the thread, time to run. A multi-threaded program allocates a fixed amount of time to each individual thread. Each and every thread runs for a short while and then pauses and relinquishes the CPU to another thread, so that other threads can get a chance to run. When this happens, all such threads that are ready to run, waiting for the CPU and the currently running thread lies in runnable state.
Timed Waiting : A thread lies in timed waiting state when it calls a method with a time out parameter. A thread lies in this state until the timeout is completed or until a notification is received. For example, when a thread calls sleep or a conditional wait, it is moved to a timed waiting state.
Waiting : A thread is in the waiting state when it waits for another thread on a condition. When this condition is fulfilled, the scheduler is notified and the waiting thread is moved to runnable state.
Blocked : A thread is in the blocked state when it tries to access a protected section of code that is currently locked by some other thread. When the protected section is unlocked, the schedule picks one of the thread which is blocked for that section and moves it to the runnable state. A thread in this state cannot continue its execution any further until it is moved to runnable state. Any thread in these states does not consume any CPU cycle.
Terminated : A thread terminates because of either of the following reasons:
Because it exists normally. This happens when the code of thread has entirely executed by the program.
Because there occurred some unusual erroneous event, like segmentation fault or an unhandled exception.
Sample Code for creating threads with diff. thread states:
public class ThreadStatesDemo {
public static class WaitingThread extends Thread {
@Override
public void run() {
Object o = new Object();
try {
synchronized (o) {
o.wait();
}
} catch (InterruptedException e) {
}
}
}
public static class SleepingThread extends Thread {
@Override
public void run() {
try {
Thread.sleep(100000);
} catch (InterruptedException e) {
}
}
}
public static class RunningThread extends Thread {
@Override
public void run() {
for (int i = 1; i > 0;) {
}
}
}
public static class TimedWaitingThread extends Thread {
@Override
public void run() {
Object o = new Object();
try {
synchronized (o) {
o.wait(100000);
}
} catch (InterruptedException e) {
}
}
}
public static Integer mutex = 0;
public static class BlockedThread extends Thread {
@Override
public void run() {
try {
Thread.sleep(1000);
synchronized (mutex) {
}
} catch (InterruptedException e) {
}
}
}
public static class BlockingThread extends Thread {
@Override
public void run() {
synchronized (mutex) {
for (int i = 1; i > 0;) {
}
}
}
}
public static void main(String[] args) {
Thread wTh = new WaitingThread();
wTh.setName("waiting");
wTh.start();
Thread sTh = new SleepingThread();
sTh.setName("sleeping");
sTh.start();
Thread rTh = new RunningThread();
rTh.setName("running");
rTh.start();
Thread twTh = new TimedWaitingThread();
twTh.setName("timed waiting");
twTh.start();
Thread bldTh = new BlockedThread();
bldTh.setName("blocked");
bldTh.start();
Thread blcTh = new BlockingThread();
blcTh.setName("blocking");
blcTh.start();
try {
Thread.sleep(1000000);
} catch (InterruptedException e) {
}
}
}
When you will run the application , and take a jstack via jstack command , you will get some output like: this:
#command to take jstack
jstack -l <pid>
2021-10-18 17:20:34
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.221-b11 mixed mode):
"blocking" #15 prio=5 os_prio=0 tid=0x00007f1ee411e800 nid=0xc99 runnable [0x00007f1eae09d000]
java.lang.Thread.State: RUNNABLE
at ThreadStatesDemo$BlockingThread.run(ThreadStatesDemo.java:69)
- locked <0x000000076e5c0bb0> (a java.lang.Integer)
Locked ownable synchronizers:
- None
"blocked" #14 prio=5 os_prio=0 tid=0x00007f1ee411c800 nid=0xc98 waiting for monitor entry [0x00007f1eae19e000]
java.lang.Thread.State: BLOCKED (on object monitor)
at ThreadStatesDemo$BlockedThread.run(ThreadStatesDemo.java:56)
- waiting to lock <0x000000076e5c0bb0> (a java.lang.Integer)
Locked ownable synchronizers:
- None
"timed waiting" #13 prio=5 os_prio=0 tid=0x00007f1ee411b000 nid=0xc97 in Object.wait() [0x00007f1eae29f000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000076e938550> (a java.lang.Object)
at ThreadStatesDemo$TimedWaitingThread.run(ThreadStatesDemo.java:41)
- locked <0x000000076e938550> (a java.lang.Object)
Locked ownable synchronizers:
- None
"running" #12 prio=5 os_prio=0 tid=0x00007f1ee4119000 nid=0xc96 runnable [0x00007f1eae3a0000]
java.lang.Thread.State: RUNNABLE
at ThreadStatesDemo$RunningThread.run(ThreadStatesDemo.java:29)
Locked ownable synchronizers:
- None
"sleeping" #11 prio=5 os_prio=0 tid=0x00007f1ee4117000 nid=0xc95 waiting on condition [0x00007f1eae4a1000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at ThreadStatesDemo$SleepingThread.run(ThreadStatesDemo.java:20)
Locked ownable synchronizers:
- None
"waiting" #10 prio=5 os_prio=0 tid=0x00007f1ee4115800 nid=0xc94 in Object.wait() [0x00007f1eae5a2000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x000000076e7fae38> (a java.lang.Object)
at java.lang.Object.wait(Object.java:502)
at ThreadStatesDemo$WaitingThread.run(ThreadStatesDemo.java:9)
- locked <0x000000076e7fae38> (a java.lang.Object)
Locked ownable synchronizers:
- None
Now in this stack trace you can see the Thread State via the line:
java.lang.Thread.State: <THREAD STATE>
eg:
java.lang.Thread.State: WAITING (on object monitor)
In the code to create various Thread States a thread a created for each state or multiple thread for same state<representing how a particular thread state is reached>.
Let’s see:
Thread State RUNNABLE – a thread with code wise just working in a single loop.
Thread State WAITING – a thread with name waiting and code wise called wait() on a Object.
Thread State BLOCKED – a thread with name blocked and code wise trying to get into synchronized block on a Object whose lock is already taken by thread with name blocking <blocking thread almost always take the lock first because of sleep in Blocked Thread>
Thread State TIMED_WAITING – a thread with name timed_waiting and code wise called wait() with time<100000> on a Object and a thread with name sleeping and code wise called sleep on Thread Object.
Now i think you got an understanding of what all java thread states are there and how thread can end up in such state, there are more ways also apart from them.
In Our previous blog we have seen mainly the filebeat and metric beat and explored the system module of that . In this blog we will see the usage of heart beat and how to monitor the services using heart beat .
Heartbeat should not be installed on each server you monitor , it should be installed on some separate servers from which you can monitor all url’s/services . For example we have one server deployed at x.x.x.x:8000 at some server in aws in north region , then we can install heart beat in our four server in each region(north,south,east,west) of aws and can monitor this server from all the servers to check weather services is UP from all india .
From these four server we can monitors all the services url’s .
For setting up the Heartbeat following is the link :
After that , On Dashboard tab you can see the MetricBeat monitoring ,
Also you can see the uptime app in kibana to check status and tls expiry time and history of all downtimes :
Followings are some screenshots:
Configuration in heartbeat.yml for setting the name of machine from where url is pinging in heartbeat.yml
processors:
- add_observer_metadata:
# Optional, but recommended geo settings for the location Heartbeat is running in
geo:
# Token describing this location
name: sahil-machine
# Lat, Lon "
#location: "37.926868, -78.024902"
Configuration in hearebeat.yml for setting to monitors urls :
heartbeat.config.monitors:
# Directory + glob pattern to search for configuration files
path: ${path.config}/monitors.d/*.yml
# If enabled, heartbeat will periodically check the config.monitors path for changes
reload.enabled: false
# How often to check for changes
reload.period: 5s
# Configure monitors inline
heartbeat.monitors:
- type: http
# Set enabled to true (or delete the following line) to enable this example monitor
enabled: false
# ID used to uniquely identify this monitor in elasticsearch even if the config changes
id: my-monitor
# Human readable display name for this service in Uptime UI and elsewhere
name: My Monitor
# List or urls to query
urls: ["http://localhost:9200"]
# Configure task schedule
schedule: '@every 10s'
# Total test connection and data exchange timeout
#timeout: 16s
# Name of corresponding APM service, if Elastic APM is in use for the monitored service.
#service.name: my-apm-service-name
- type: http
# Set enabled to true (or delete the following line) to enable this example monitor
enabled: true
# ID used to uniquely identify this monitor in elasticsearch even if the config changes
id: emerge-gurgaon
# Human readable display name for this service in Uptime UI and elsewhere
name: emerge-gurgaon
# List or urls to query
urls: ["https://app.ameyoemerge.in:8887/"]
# Configure task schedule
schedule: '@every 10s'
# Total test connection and data exchange timeout
#timeout: 16s
# Name of corresponding APM service, if Elastic APM is in use for the monitored service.
#service.name: my-apm-service-name
- type: http
# Set enabled to true (or delete the following line) to enable this example monitor
enabled: true
# ID used to uniquely identify this monitor in elasticsearch even if the config changes
id: emerge-banglore-app24
# Human readable display name for this service in Uptime UI and elsewhere
name: emerge-banglore-app24
# List or urls to query
urls: ["https://app24.ameyoemerge.in:8887/"]
# Configure task schedule
schedule: '@every 10s'
# Total test connection and data exchange timeout
#timeout: 16s
# Name of corresponding APM service, if Elastic APM is in use for the monitored service.
#service.name: my-apm-service-name
In the next blog we will explore Logstash with filebeat . Happy debugging . . .
As nowadays lots of our servers are deployed on Cloud and many applications are running on these servers , it is impossible to monitor and analyze logs by going to each servers . Central Logging and Monitoring solution is a must in present time .
In this Bog Series , we will learn about usage of Elastic Stack aka ELK .
Overview :
Elastic Stack is a group of open source products from Elastic designed to help users take data from any type of source and in any format and search, analyze, and visualize that data in real time. The product group was formerly known as ELK Stack, in which the letters in the name stood for the products in the group: ElasticSearch, Logstash and Kibana. A fourth product, Beats, was subsequently added to the stack, rendering the potential acronym unpronounceable. Elastic Stack can be deployed on premises or made available as Software as a Service
Architechture :
For a small-sized development environment, the classic architecture will look as follows :
Here there are many different types of beats you can read them from https://www.elastic.co/beats/ . Each beat have different set of usecases .
In this blog we will learn about two beats MetricBeat and FileBeat .
Note – LogStash is an options part in the architecture and should not be needed in most of the cases . Read more about Logstash at https://www.elastic.co/logstash/
Usage Elastic Stack :
I am running experiments on CentOS7 machine and using rpm to setup the elastic stack .
If you are getting output like above , it means elastic search is installed successfully .
Note : To change listen address and port you can change in the following file : /etc/elasticsearch/elasticsearch.yml
Kibana :
Kibana is the Front end tool which communicates to Elastic search where anyone can monitor and analyze logs .
Commands to install kibana :
curl -L -O https://artifacts.elastic.co/downloads/kibana/kibana-7.14.0-linux-x86_64.tar.gz
tar xzvf kibana-7.14.0-linux-x86_64.tar.gz
cd kibana-7.14.0-linux-x86_64/
./bin/kibana
Note : configure vim config/kibana.yml for port and ip addressed for listening settings .
Beats
These will be installed on all servers from where we want to collect information . they are like agents which will send data to Elastic Search .
Enabling Metric Beat :
Every Beats supports different modules , it is up to the use that which module one wnts to enable in each beats . if we talk about MetricBeat it has many modules like System,Postgres,Nginx and so on . In this Blog we will see usage of System Module of MetricBeat .
sudo metricbeat modules enable system
sudo metricbeat setup -e
sudo service metricbeat start
Here we are only enabling system module of metri beats , there are many odule for basic monitoring of aplications like postgresql , nginx , tomcat etc .
Fo list of modules available in metric beats : command is
metricbeat modules list
Yeipeee Now we can Monitor System Data in kibana as follows .
Open [Metricbeat System] Host overview ECS in Dashboards in kibana UI . There you can apply filter of host of which one wants to see data .
System Module MetricBeat Uses : What analysis can be Done by System module of MetricBeat :
Traditionally after accessing linux servers , we gather system information by using many different commands and tools which also takes time , specially when there is some running issue on production .
Following is the list of information :
Size information of all partitions
Read/Write Performance of Hardisk
InboundOutBound Traffic analysis per Ethernet Port
Load Avergae analysis of system
Top Proesses consuming High CPU and RAM
All these type of information now can be seen in seconds for some particular host using kibana UI .
Following are some screenshots :
Enabling FileBeat
Whether you’re collecting from security devices, cloud, containers, hosts, or OT, Filebeat helps you keep the simple things simple by offering a lightweight way to forward and centralize logs and files.
Note : For configuring filebeat that where to send data to elastic search or filebeat configue in /etc/filebeat/filebeat.yml , cureent as i have only one machine so no need to do an conf as defaut conf will work for me You can check the following lion : https://www.elastic.co/guide/en/beats/filebeat/7.14/configuring-howto-filebeat.html
enabling system logs module in filebeat :
filebeat modules enable system
(for system logs if we want to set custom paths : edit the file /etc/filebeat/modules.d/system.yml) -- Generally no need to change these config in all cases
filebeat setup -e
sudo service filebeat start
Like Metric Beat , FileBeats also have list of modules like postgres,nginx , and it also supports logging of popular framework like spring and can collect logs of these applications and provides ways to analyze them easily .
To check modules list available for filebeat use following command :
[root@localhost elk]# filebeat modules list | less
System Module Filebeat Uses :
Now you can use Kibana UI to analyze system logs like messages etc .
Open [Filebeat System] Syslog dashboard ECS in Dashboard Tab in Kibana .
Following are some screen shots which one can see :
Configure filebeat for custom log files :
Now we may have situation where none of Modules and integration with framework logging work in filebeat for our custom application log then in that case you can configure your input manually to configure path of logs to read and analayse them in logs and stream section in kibana UI
Here you can search in logs by hostname , filepath and can also search in whole message which is fetched .
By default only message column is shown . One can configure then in settings tabs of logs tabs in kibana .
Following are some screenshot :
By Default logs lines are only one column , if for advance debugging we want to break log tine into columns then we need to use Logstash with Grok Filter .
In next blog we will see the usage of LogStash to break custom logs into columns for better understanding .
In our Daily debugging we need to analyze logs files of various products . Reading those log files are not an easy task , it requires special debugging skills which can only be gained through experience or by god’s grace . Now while debugging we might need to extract some of data or we need to play with a log file which can not be done by just reading , there is need for commands .
There are many commands in linux which are used by debuggers like grep,awk,sed,wc,taskset,ps,sort,uniq,cut,xargs etc . . .
In this blog we will see usage of Practical grep commands examples useful in real world debugging in Linux . The examples which we will see in this blog are super basic but very useful in real life which a beginner should read to enhance the debugging skills .
Let’s Go to the Practical Part
Grep the lines which contains some particular word
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep 'sahil' file1.log
i am sahil
sahil is a software engineer
Grep number of lines matched for a particualar word in a file
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep -c 'sahil' file1.log
2
Another way :
[root@localhost playground]# grep 'sahil' file1.log | wc -l
2
Grep all the lines in which contains some word in a file with case insensitive
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep -i 'sahil' file1.log
i am sahil
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]#
Grep the lines in which either of two words are present in a file
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep 'sahil\|software' file1.log
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]#
Grep lines in which two words are present
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep 'sahil' file1.log | grep 'software'
sahil is a software engineer
[root@localhost playground]# ^C
[root@localhost playground]#
Eliminate lines which contains some word
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep -v 'sahil' file1.log
hello
i am software engineer
Sahil is a software engineer
Eliminate case insensitively
[root@localhost playground]# grep -iv 'sahil' file1.log
hello
i am software engineer
[root@localhost playground]#
Matching the lines that start with a string
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep '^sahil' file1.log
sahil is a software engineer
Matching the lines that end with a string
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep 'engineer$' file1.log
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]#
Getting n number of lines after each match
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]#
[root@localhost playground]# grep 'hello' file1.log
hello
[root@localhost playground]# grep -A 1 'hello' file1.log
hello
i am sahil
[root@localhost playground]# grep -A 2 'hello' file1.log
hello
i am sahil
i am software engineer
Geeting n number of lines before each match
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep 'i am sahil' file1.log
i am sahil
[root@localhost playground]# grep -B 1 'i am sahil' file1.log
hello
i am sahil
[root@localhost playground]# grep -B 2 'i am sahil' file1.log
hello
i am sahil
[root@localhost playground]#
Grep n lines after and m lines before every match
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# grep -A 2 -B 1 'i am sahil' file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
[root@localhost playground]#
Grep some word in more than one file in current directory
[root@localhost playground]# cat file1.log
hello
i am sahil
i am software engineer
Sahil is a software engineer
sahil is a software engineer
[root@localhost playground]# cat file2.log
hello
i am sahil
i am tech blogger
Sahil is a tech blogger
sahil is a tech blogger
[root@localhost playground]# grep 'sahil' file1.log file2.log
file1.log:i am sahil
file1.log:sahil is a software engineer
file2.log:i am sahil
file2.log:sahil is a tech blogger
Grep some word in all files in current directory
[root@localhost playground]# grep 'sahil' *
file1.log:i am sahil
file1.log:sahil is a software engineer
file2.log:i am sahil
file2.log:sahil is a tech blogger
[root@localhost playground]#
[root@localhost playground]# cat file3.log
time taken by api is 1211 ms
time taken by api is 2000 ms
time taken by api is 3000 ms
time taken by api is 4000 ms
time taken by api is 50000 ms
time taken by api is 123 ms
time taken by api is 213 ms
time taken by api is 456 ms
time taken by api is 1000 ms
Now suppose we want to grep all the lines in which time taken by any api is more than 1 second or more than 1000 ms , it means it should have minimum 4 digit number .
Now grep command for this will be as follows :
[root@localhost playground]# grep -P '[0-9]{4} ms' file3.log
time taken by api is 1211 ms
time taken by api is 2000 ms
time taken by api is 3000 ms
time taken by api is 4000 ms
time taken by api is 50000 ms
time taken by api is 1000 ms
If want to get 5 digit number
[root@localhost playground]# grep -P '[0-9]{5} ms' file3.log
time taken by api is 50000 ms
Recursively grep in a directory and sub directoies
[root@localhost playground]# grep -R 'sahil' .
./dir1/file.log:i am sahil
./dir1/file.log:sahil is a software engineer
./file1.log:i am sahil
./file1.log:sahil is a software engineer
./file2.log:i am sahil
./file2.log:sahil is a tech blogger
[root@localhost playground]#
All above are basic use cases of grep . One can mix all the command options of grep to achieve the complex use cases and also one can also mix different grep commands using pipe operator to achieve complex use cases .
In future blogs i will explain some complex use cases and example how to achieve that using linux commands which can ease logs debugging .
Regarding Load Average shown in Linux there are many confusion around the world like
What is load average show in top command ?
What this values represent ?
When it will be high when low ?
When to consider it as critical ?
In which scenarios it can increase ?
In this Blog we will talk about the answers of all these .
What are these three values shown in above image ?
The three numbers represent averages over progressively longer periods of time (one, five, and fifteen-minute averages) and that lower numbers are better. Higher numbers represent a problem or an overloaded machine .
Now before getting into what is good value, what is bad value , what are the reasons which can affect these values , We will understand these on a machine with one single-core processor.
The traffic analogy
A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator … sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they’re in for delays.
So, Bridge Operator, what numbering system are you going to use? How about:
0.00 means there’s no traffic on the bridge at all. In fact, between 0.00 and 1.00 means there’s no backup, and an arriving car will just go right on.
1.00 means the bridge is exactly at capacity. All is still good, but if traffic gets a little heavier, things are going to slow down.
over 1.00 means there’s backup. How much? Well, 2.00 means that there are two lanes worth of cars total — one lane’s worth on the bridge, and one lane’s worth waiting. 3.00 means there are three lanes worth total — one lane’s worth on the bridge, and two lanes’ worth waiting. Etc.
Like the bridge operator, you’d like your cars/processes to never be waiting. So, your CPU load should ideally stay below 1.00. Also, like the bridge operator, you are still ok if you get some temporary spikes above 1.00 … but when you’re consistently above 1.00, you need to worry.
So you’re saying the ideal load is 1.00?
Well, not exactly. The problem with a load of 1.00 is that you have no headroom. In practice, many sysadmins will draw a line at 0.70:
But now a days we many multiple cores systems or multiple processors system .
Got a quad-processor system? It’s still healthy with a load of 3.00.
On a multi-processor system, the load is relative to the number of processor cores available. The “100% utilization” mark is 1.00 on a single-core system, 2.00, on a dual-core, 4.00 on a quad-core, etc.
If we go back to the bridge analogy, the “1.00” really means “one lane’s worth of traffic”. On a one-lane bridge, that means it’s filled up. On a two-lane bridge, a load of 1.00 means it’s at 50% capacity — only one lane is full, so there’s another whole lane that can be filled.
Same with CPUs: a load of 1.00 is 100% CPU utilization on a single-core box. On a dual-core box, a load of 2.00 is 100% CPU utilization.
Which leads us to two new Rules of Thumb:
The “number of cores = max load” Rule of Thumb: on a multicore system, your load should not exceed the number of cores available.
The “cores is cores” Rule of Thumb: How the cores are spread out over CPUs doesn’t matter. Two quad-cores == four dual-cores == eight single-cores. It’s all eight cores for these purposes.
But What to extract from here that if load is going beyond number of Cores are we in crunch of Cores ?
Well Not exactly , For this We need to further debug or analyse TOP command data to come to a conclusion .
In above output , This coloured part shows CPU used by user process(us) and by system process(sy) . Now if these values are around 99-100%, it means there is crunch of cpu cores on your system or some process is consuming more CPU . So, in this case either increase cores or optimise you application which is consuming more CPU .
Now let’s take another scenario :
In above image , coloured parts shows amount of time CPU is waiting in doing Input/Output(I/O) . So say if this values is going above say 80% ,then also load average on server will increase . It means either you disk read/write speed is slow or your applications is reading/writing too much on your system beyond system capability . In this case either diagnose your hard disk read/write speed or check why your application is reading/writing so much .
Now let’s take one more scenario :
If values Above coloured output goes beyond certain limit , it means softirq(si) are consuming cpu . it can be due to network/disk interrupts . Either they are not getting enough CPU or there is some misconfiguration in you ethernet ports due to which interrupts are not able to handle packets receiving or transmitting . These types of problem occurs more on VM environment rather than physical machine .
Now , Lest take one last scenario :
This above part will help you in case of Virtual Machine Environment .
If %st increases to certain limit say it is remaining more than 50% , it means that you are getting half of CPU time from base machine and someone else is consuming you CPU time as you may be on shared CPU infra . In above case also Load can increase .
I hope it covers all major scenarios in which load average can increase .
Now there are many points open here like how to check read/write speed, how to check which application is consuming more CPU , How to check which interrupts are causing problem . For all of these stay tuned to hello worlds .
There are many types of queries and of course all are important , but when it comes to monitoring and debugging , we are generally in a hurry and do not remember what to query to check the problem/issue though . We at that time thinks what’s can we find and what will be the query for this , we then search on internet all somehow make some queries and then after issue we forgot . Here in this Blog i will brief some queries and its uses specially in case of Monitoring and Debugging Postgresql Related issue .
In case of Monitoring and Debugging , We generally have following category :
Size Related Issues
Performance Related issues
Sometimes These issues are interlinked
Size Related Issues
Following will be basic postgres commands/queries which will be very helpful in monitoring or debugging size related issues :
How to check Size of Database — — many times we need to check that which database is culprit i.e… which database is consuming major space . Following is query to check size of database in decreasing order according to size consumed
SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database ORDER BY pg_database_size(pg_database.datname) desc;
After that We may need to Find Which Table/Index is consuming the most space for this we have the following query :
SELECT
nspname || '.' || relname AS "relation", pg_size_pretty(pg_relation_size(C.oid)) AS "size"
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY pg_relation_size(C.oid) DESC
LIMIT 20;
Many Times There are Not much records in the Table But due to dead tuples size of table/index may increase , So to check which table has highest dead tuples following the the query :
First Goto the Database then :
SELECT schemaname,relname,n_live_tup,n_dead_tup,last_vacuum,last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC limit 10;
Here , from above output, we can also determine if auto vacuum is running properly or not i.e.. When last auto vacuum ran on this any particular table whose dead tuples are high .
Overtime due to MVCC, your table will grow in size (called table bloat) — this is why regular VACUUM is needed. This query will show you list of tables and indexes with the most bloats.
The value represents the number of “wasted bytes”, or the difference between what is actually used by the table and index, and what we compute that it should be.
The way it works is it estimates the optimized size of the table/index by a calculation from each row sizes times total rows, and compare that against the actual table size. Do note that this is an estimate, not an actual figure.
with foo as (
SELECT
schemaname, tablename, hdr, ma, bs,
SUM((1-null_frac)*avg_width) AS datawidth,
MAX(null_frac) AS maxfracsum,
hdr+(
SELECT 1+COUNT(*)/8
FROM pg_stats s2
WHERE null_frac<>0 AND s2.schemaname = s.schemaname AND s2.tablename = s.tablename
) AS nullhdr
FROM pg_stats s, (
SELECT
(SELECT current_setting('block_size')::NUMERIC) AS bs,
CASE WHEN SUBSTRING(v,12,3) IN ('8.0','8.1','8.2') THEN 27 ELSE 23 END AS hdr,
CASE WHEN v ~ 'mingw32' THEN 8 ELSE 4 END AS ma
FROM (SELECT version() AS v) AS foo
) AS constants
GROUP BY 1,2,3,4,5
), rs as (
SELECT
ma,bs,schemaname,tablename,
(datawidth+(hdr+ma-(CASE WHEN hdr%ma=0 THEN ma ELSE hdr%ma END)))::NUMERIC AS datahdr,
(maxfracsum*(nullhdr+ma-(CASE WHEN nullhdr%ma=0 THEN ma ELSE nullhdr%ma END))) AS nullhdr2
FROM foo
), sml as (
SELECT
schemaname, tablename, cc.reltuples, cc.relpages, bs,
CEIL((cc.reltuples*((datahdr+ma-
(CASE WHEN datahdr%ma=0 THEN ma ELSE datahdr%ma END))+nullhdr2+4))/(bs-20::FLOAT)) AS otta,
COALESCE(c2.relname,'?') AS iname, COALESCE(c2.reltuples,0) AS ituples, COALESCE(c2.relpages,0) AS ipages,
COALESCE(CEIL((c2.reltuples*(datahdr-12))/(bs-20::FLOAT)),0) AS iotta -- very rough approximation, assumes all cols
FROM rs
JOIN pg_class cc ON cc.relname = rs.tablename
JOIN pg_namespace nn ON cc.relnamespace = nn.oid AND nn.nspname = rs.schemaname AND nn.nspname <> 'information_schema'
LEFT JOIN pg_index i ON indrelid = cc.oid
LEFT JOIN pg_class c2 ON c2.oid = i.indexrelid
)
SELECT
current_database(), schemaname, tablename, /*reltuples::bigint, relpages::bigint, otta,*/
ROUND((CASE WHEN otta=0 THEN 0.0 ELSE sml.relpages::FLOAT/otta END)::NUMERIC,1) AS tbloat,
CASE WHEN relpages < otta THEN 0 ELSE bs*(sml.relpages-otta)::BIGINT END AS wastedbytes,
iname, /*ituples::bigint, ipages::bigint, iotta,*/
ROUND((CASE WHEN iotta=0 OR ipages=0 THEN 0.0 ELSE ipages::FLOAT/iotta END)::NUMERIC,1) AS ibloat,
CASE WHEN ipages < iotta THEN 0 ELSE bs*(ipages-iotta) END AS wastedibytes
FROM sml
ORDER BY wastedbytes DESC
dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_status | 0.0 | 0
dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_tag | 0.3 | 0
dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_tenant_id | 0.2 | 0
dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_created_at | 0.2 | 0
dashboard | public | jobs | 1.1 | 2013282304 | index_jobs_on_created_at_queued_or_running | 0.0 | 21086208
You should be looking at:
tbloat: table bloat, ratio between what it current is, and what it can be optimized to.
wastedbytes: number of bytes wasted
ibloat & wastedibytes: same as above, but for indexes.
When you see a table with high bloats, then consider running VACUUM ANALYZE on it,
Performance Related issues
For performance Related monitoring we need to run following queries too much in day to day scenario
Get Running Queries (And Lock statuses) in PostgreSQL
SELECT S.pid, age(clock_timestamp(), query_start),usename,query,L.mode,L.locktype,L.granted,s.datname FROM pg_stat_activity S inner join pg_locks L on S.pid = L.pid order by L.granted, L.pid DESC;
Above query many help if some particular query is running for hours and has taken some lock due to which others queries are stuck on lock . It also shows database name on which qurie is running . We can also determine if some connection are open and no query running so there may be some session leak in your database (for more for about session leak read following article)
To kill some particular query which is causing the issue following is the query :
SELECT pg_cancel_backend(pid);
This will send a SIGINT to the current process.
Hope It will help all of you in altleast some times in your debugging . Please comment and follow me.