# Load Average in Linux Servers – Confusion Solved

Regarding Load Average  shown in Linux there are many confusion around the world  like

• What is load average show in top command ?
• What this values represent ?
• When it will be high when low ?
• When to consider it as critical ?
• In which scenarios it can increase ?

In this Blog we will talk about the answers of all these  .

What are these three values shown in above image  ?

The three numbers represent averages over progressively longer periods of time (one, five, and fifteen-minute averages)  and that lower numbers are better. Higher numbers represent a problem or an overloaded machine .

Now before getting into what is good value, what is bad value , what are the reasons which can affect these values  , We will understand these on  a machine with one single-core processor.

## The traffic analogy

A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator … sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they’re in for delays.

So, Bridge Operator, what numbering system are you going to use? How about:

• 0.00 means there’s no traffic on the bridge at all. In fact, between 0.00 and 1.00 means there’s no backup, and an arriving car will just go right on.
• 1.00 means the bridge is exactly at capacity. All is still good, but if traffic gets a little heavier, things are going to slow down.
• over 1.00 means there’s backup. How much? Well, 2.00 means that there are two lanes worth of cars total — one lane’s worth on the bridge, and one lane’s worth waiting. 3.00 means there are three lanes worth total — one lane’s worth on the bridge, and two lanes’ worth waiting. Etc.

Like the bridge operator, you’d like your cars/processes to never be waiting. So, your CPU load should ideally stay below 1.00. Also, like the bridge operator, you are still ok if you get some temporary spikes above 1.00 … but when you’re consistently above 1.00, you need to worry.

## So you’re saying the ideal load is 1.00?

Well, not exactly. The problem with a load of 1.00 is that you have no headroom. In practice, many sysadmins will draw a line at 0.70:

But now a days we many multiple cores systems or multiple processors system  .

Got a quad-processor system? It’s still healthy with a load of 3.00.

On a multi-processor system, the load is relative to the number of processor cores available. The “100% utilization” mark is 1.00 on a single-core system, 2.00, on a dual-core, 4.00 on a quad-core, etc.

If we go back to the bridge analogy, the “1.00” really means “one lane’s worth of traffic”. On a one-lane bridge, that means it’s filled up. On a two-lane bridge, a load of 1.00 means it’s at 50% capacity — only one lane is full, so there’s another whole lane that can be filled.

Same with CPUs: a load of 1.00 is 100% CPU utilization on a single-core box. On a dual-core box, a load of 2.00 is 100% CPU utilization.

Which leads us to two new Rules of Thumb:

• The “number of cores = max load” Rule of Thumb: on a multicore system, your load should not exceed the number of cores available.
• The “cores is cores” Rule of Thumb: How the cores are spread out over CPUs doesn’t matter. Two quad-cores == four dual-cores == eight single-cores. It’s all eight cores for these purposes.

But What to extract from here that if load is going beyond number of Cores are we in crunch of Cores ?

Well  Not exactly , For this We need to further debug or analyse TOP command data to come to a conclusion  .

In above output , This coloured part shows CPU used by user process(us) and by system process(sy) . Now if these values are around 99-100%, it means there is crunch of cpu cores on your system or some process is consuming more CPU . So, in this case either increase cores or optimise you application which is consuming more CPU .

Now let’s take another scenario :

In above image  , coloured parts shows amount of time CPU is waiting in doing Input/Output(I/O) .  So say if this values is going above say 80% ,then also load average on server will increase  . It means either you disk read/write speed is slow  or your applications is reading/writing too much on your system beyond system capability . In this case either diagnose your hard disk read/write speed or check why your application is reading/writing so much .

Now let’s take one more scenario :

If values Above coloured output goes beyond certain limit , it means softirq(si) are consuming cpu  . it can be due to network/disk interrupts . Either they are not getting enough CPU   or there is some misconfiguration in you ethernet ports due to which  interrupts are not able to handle packets receiving or transmitting  . These types of problem occurs more on VM environment rather than physical machine  .

Now , Lest take one last scenario :

This above part will help you in case of Virtual Machine Environment  .

If %st increases to certain limit say it is remaining more than 50% , it means that you are getting half of CPU time from base machine and someone else is consuming you CPU time as you may be on shared CPU infra  . In above case also Load can increase  .

I hope it covers all major scenarios in which load average can increase  .

Now there are many points open here like how to check read/write speed, how to check which application is consuming more CPU , How to check which interrupts are causing problem . For all of these stay tuned to hello worlds .

# Checklists – System is Hacked – Part 2 – Preventive Steps for Infra (OS Hardening)

In last article we described List of Checks which can determine if system is compromised or hacked .  In this article we will talk about preventive steps (specially infra related) can be taken care to avoid hacking or to make system more secure  . There are many directions in which we can secure our application  as follows :

• OS hardening (Infra Level Security)
• Secure Coding guidelines
• Encryption Of Sensitive Data  .
• Ensure No Vulnerability exists in system .

In this Blog we will be concerned about OS hardening (Infra Level Security) in Linux systems(CentOS/Redhat). We will Cover Other parts in Future Blogs .

Now Let’s go to the System Part. It has following things to be taken care of :

• SSH Configuration :
• In linux based system SSH default port is 22 . This Defaut port should be changed to some unused port to enhance security .
• Use SSH Protocol 2 Version
• Ensure SSH X11 forwarding is disabled
• Port Configuration at Firewall :  Generally , in any application there are many applications running on set of servers and each running on some different ports , Say for example :
• Application server at  8080 port
• Database Server at 5432 port

So,  as in above Case Users need to login through 8080 port so only this port should be opened for public as Database needs to interact generally with application server so 5432 port should be allowed from Application Server’s IP  .

• Multi Factor Authentication for SSH should be enabled   —  For setting up Google Authentication on CentOS or Redhat you can follow the link
• Root login for any server must be disabled
• Ensure password expiration is 365 days or less
• Ensure minimum days between password changes is 7 or more
• Ensure password expiration warning days is 7 or more
• Ensure inactive password lock is 30 days or less
• Ensure Password should be strong enough when user change its password
• Application and Database should be on different Servers  :  this is because of that if due to some vulnerability  application hacked than acces to database in that case is protected  .
• Regular package updates   :  Configure Auto update or regularly update packages on all configured servers .
• Tune Network Kernel Parameters :
• IP forwarding should be disabled on all servers
• Do the following entry in sysctl.conf
• net.ipv4.ip_forward = 0
• Packets Redirecting  should be diabled on all servers .
• Do the following entry in sysctl.conf
• net.ipv4.conf.all.send_redirects = 0
• net.ipv4.conf.default.send_redirects = 0
• Selinux should be enabled and configured .
• Antivirus must  be installed on all servers .

All Above are basic minimum checklists which should be applied to all the servers in any production environment . For implementing in-depth OS Hardening specially for CentOS based Systems , one need to follow the latest CIS CentOS Benchmarklatest CIS

You can also check the below benchmark list from CIS for CentOS hardening : Below doc also explain how to implement things on CentOS .

In Our Future blog we will explain other parts like Secure Code guidelines , Encryption , VAPT scan etc  to make system more secure .

Stay tuned .

# Tune Linux Kernel Parameters For PostgreSQL Optimization and better System Performance

## Introduction

In my previous Article i explained  Tuning PostgreSQL Database Memory Configuration Parameters to Optimize Performance and as i said  Database performance does not only depend on Postgresql configurations but also on system parameters .Poorly configured OS kernel parameters can cause degradation in database server performance. Therefore, it is imperative that these parameters are configured according to the database server and its workload. In this article  i will be talking about centos/redhat  linux system specially .

## Story

I will start the article with small story where on one of our client huge amount of writes were there and customer have provided us 200 GB of RAM for that dedicated database server , So there were no problem of resources.

Now what was happening that after sometime system loads get increased so much and on debugging we found no special query  around the time when load increases . Somewhere over internet we found if we clear the system cache  regularly then issue will be resolved .

We then schedule a cron to clear system cache after some regular interval and issue got resolved .

Now  the question is why issue was not coming after this ? ? ?

And the Answer is  that due to large cache size as we have so much of ram available   lots of data is collected in RAM (in GB’s) and when this whole data flushes out on to the disk   ,  system load becomes high at that time

So from that we came to know that we also need to tune some system parameters also to optimize system and database(postgresql) performance .

In above case we tuned vm.dirty_background_ratio and vm.dirty_ratio , these two system(os) parameters to resolve the issue .

## Kernel parameters Tuning

Now what values we set for these above two parameters described in story and what are all other  important Linux kernel parameters that can affect database server performance which we can tune are described as follows :

### vm.dirty_background_ratio / vm.dirty_background_bytes

The vm.dirty_background_ratio is the percentage of memory filled with dirty pages that need to be flushed to disk. Flushing is done in the background. The value of this parameter ranges from 0 to 100; however, a value lower than 5 may not be effective and some kernels do not internally support it. The default value is 10 on most Linux systems. You can gain performance for write-intensive operations with a lower ratio, which means that Linux flushes dirty pages in the background.

You need to set a value of vm.dirty_background_bytes depending on your disk speed.

There are no “good” values for these two parameters since both depend on the hardware. However, setting vm.dirty_background_ratio to 5 and vm.dirty_background_bytes to 25% of your disk speed improves performance by up to ~25% in most cases.

### vm.dirty_ratio / dirty_bytes

This is the same as vm.dirty_background_ratio / dirty_background_bytes except that the flushing is done in the foreground, blocking the application. So vm.dirty_ratio should be higher than vm.dirty_background_ratio. This will ensure that background processes kick in before the foreground processes to avoid blocking the application, as much as possible. You can tune the difference between the two ratios depending on your disk IO

### vm.swappiness

vm.swappiness is another kernel parameter that can affect the performance of the database. This parameter is used to control the swappiness (swapping pages to and from swap memory into RAM) behavior on a Linux system. The value ranges from 0 to 100. It controls how much memory will be swapped or paged out. Zero means disable swap and 100 means aggressive swapping.

You may get good performance by setting lower values.

Setting a value of 0 in newer kernels may cause the OOM Killer (out of memory killer process in Linux) to kill the process. Therefore, you can be on the safe side and set the value to 1 if you want to minimize swapping. The default value on a Linux system is 60. A higher value causes the MMU (memory management unit) to utilize more swap space than RAM, whereas a lower value preserves more data/code in memory.

A smaller value is a good bet to improve performance in PostgreSQL.

### vm.overcommit_memory / vm.overcommit_ratio

Applications acquire memory and free that memory when it is no longer needed. But in some cases, an application acquires too much memory and does not release it.  This can invoke the OOM killer. Here are the possible values for vm.overcommit_memory parameter with a description for each:

1. Heuristic overcommit, Do it intelligently (default); based kernel heuristics
2. Allow overcommit anyway
3. Don’t over commit beyond the overcommit ratio.

vm.overcommit_ratio is the percentage of RAM that is available for overcommitment. A value of 50% on a system with 2 GB of RAM may commit up to 3 GB of RAM.

A value of 2 for vm.overcommit_memory yields better performance for PostgreSQL. This value maximizes RAM utilization by the server process without any significant risk of getting killed by the OOM killer process. An application will be able to overcommit, but only within the overcommit ratio, thus reducing the risk of having OOM killer kill the process. Hence a value to 2 gives better performance than the default 0 value. However, reliability can be improved by ensuring that memory beyond an allowable range is not overcommitted. It avoids the risk of the process being killed by OOM-killer.

On systems without swap, one may experience a problem when vm.overcommit_memory is 2.

https://www.postgresql.org/docs/current/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

Generally speaking almost all applications which uses more memory depends on this , For example  , In Redis setting this value 1 is best .

### Turn On Huge Pages

Linux, by default, uses 4K memory pages, BSD has Super Pages, whereas Windows has Large Pages. A page is a chunk of RAM that is allocated to a process. A process may own more than one page depending on its memory requirements. The more memory a process needs the more pages that are allocated to it. The OS maintains a table of page allocation to processes. The smaller the page size, the bigger the table, the more time required to look up a page in that page table. Therefore, huge pages make it possible to use a large amount of memory with reduced overheads; fewer page lookups, fewer page faults, faster read/write operations through larger buffers. This results in improved performance.

PostgreSQL has support for bigger pages on Linux only. By default, Linux uses 4K of memory pages, so in cases where there are too many memory operations, there is a need to set bigger pages. Performance gains have been observed by using huge pages with sizes 2 MB and up to 1 GB. The size of Huge Page can be set boot time. You can easily check the huge page settings and utilization on your Linux box using cat /proc/meminfo | grep -i huge command.

Get HugePage Info – On Linux (only)

In this example, although huge page size is set at 2,048 (2 MB), the total number of huge pages has a value of 0. which signifies that huge pages are disabled.

#### Script to quantify Huge Pages

This is a simple script which returns the number of Huge Pages required. Execute the script on your Linux box while your PostgreSQL is running. Ensure that \$PGDATA environment variable is set to PostgreSQL’s data directory.

Get Number of Required HugePages

The output of the script looks like this:

Script Output

The recommended huge pages are 88, therefore you should set the value to 88.

Set HugePages Command :

Check the huge pages now, you will see no huge page is in use (HugePages_Free = HugePages_Total).

Again Get HugePage Info – On Linux (only)

Now set the parameter huge_pages “on” in \$PGDATA/postgresql.conf and restart the server.

And Again Get HugePage Info – On Linux (only)

Now you can see that a very few of the huge pages are used. Let’s now try to add some data into the database.

Some DB Operations to Utilise HugePages

Let’s see if we are now using more huge pages than before.

Once More Get HugePage Info – On Linux (only)

Now you can see that most of the huge pages are in use.

Note: The sample value for HugePages used here is very low, which is not a normal value for a big production machine. Please assess the required number of pages for your system and set those accordingly depending on your system’s workload and resources.

Now, Tuning Postgresql parameters and kernel parameters is not enough for good Postgresql performance there are many other things like

• How you are making Query
• Proper Indexing — For this you can follow indexing series on our blog
• Proper partitioning and sharding accroding to business usecase
• and many more .

Stay tuned to get more blogs on optimizing postgresql performance

Refrences : https://www.percona.com/blog/2018/08/29/tune-linux-kernel-parameters-for-postgresql-optimization/

# Checklists – System is Compromised or Hacked – Part 1

## Introduction

As in my previous Blog where i explained how i came to know if my system is hacked or compromized (link here). Here in this blog i will explain what basic things we can check on our system when we have doubt if our system is compromized .

This Blogs have 3 parts

• List of Checks which can determine if system is compromised or hacked – Part 1
• List of checks which can give a direction how system is compromised or hacked – Part 2
• What preventive steps (specially infra related) can be taken care to avoid hacking or to make system more secure – Part 3

Here , i am assuming system is Linux system with Centos installed .

## List of Checks which can determine if system is compromised or hacked

• Generally when hacker break into a linux system it is high chance that it will alter you main packages like openssh,kernel etc.. , So first if of please check if these packages are altered or there are some changes in the files or binaries provided by these packages . Following are commands to check on Centos
• `sudo rpm -qa | grep openssh | xargs -I '{}' sudo rpm -V '{}'`
• If therr are files shown by above command in which you did not change anything then it means there is high chance your system is compromised
• Run rootkit Hunter to check if you system is compromised
• `Download rkhunter.tar.gz `
• `copy it in /root and goto /root `
• `tar zxvf rkhunter-1.4.2.tar.gz `
• `cd rkhunter-1.4.2/ `
• `sh installer.sh --layout default --install `
• `changes in /etc/rkhunter.conf ENABLE_TESTS="all" DISABLE_TESTS="none" HASH_CMD=SHA1 HASH_FLD_IDX=4 PKGMGR=RPM 7`
• `/usr/local/bin/rkhunter --propupd `
• `/usr/local/bin/rkhunter --update `
• `/usr/local/bin/rkhunter -c -sk 10. `
• `note output or check and copy /var/log/rkhunter.log`
• `you can also check the link for using rkhunter`
• Check /var/log/secure to check if there are many authentication failure requests and someone trying brute force to enter in to system
• following will be the comand :
• `[root@localhost ~]# less /var/log/secure | grep 'authentication failures'`
• and output will be something like :
• `Apr 25 12:48:46 localhost sshd[2391]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=192.168.29.14 user=root`
• `Apr 25 12:49:33 localhost sshd[2575]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=192.168.29.14 user=root`
• In above output you can see the rhost from where login attempt are made . If you see lots of entries like this then also check if at some point of time some login attempt will be successful from any of attempting rhosts . In secure logs accepted logs will looks something like as follows :
• `Apr 25 12:53:10 localhost sshd[3551]: Accepted password for root from 192.168.29.14 port 36362 ssh2`

• Check in Processes if some unusual process is running and consuming high CPU using top and ps commands .
• Command to list all process running in system : `ps aux | less`
• Also check using top command if some unusual process trying to utiize high cpu
• Check if there is some unusual entry in crontab of all users made on system
• `crontab -u <user> -l . by default user is root`
• Check if in id_rsa.pub , if some attacker has somehow made its entry in .ssh folder in every users’s home directory .

This was the Part 1 of the Blog , In later Parts i will explain some further checklist to ensure that you system will remain less hackable .

Thankyou .