Understanding Postgresql Indexes For Beginners- Part -1 (BTree)

In this series we will understand postgresql indexes and also when to use them with sample cases.

Also we will be comparing certain indexes to understand which one to use when.

Let’s Begin:

B-Tree

The most common , highly used index on Postgresql or in any Sql DB.

What btree does is create a self balancing tree (with multiple child nodes not 2) in which the leaf nodes point to the exact row location.

Also the Btree all leaf nodes pointers point to adjacent node using doubly linked list. (this primarily help in getting the sorted data in very fast as we just reach to first element and then traverse further will see this in action).

Lets see a diagram to understand: here the leaf nodes are marked in white and they points to each other and contain row address and you can see they are in sorted order.

How to use?

First create a table with dummy data to understand, here i am creating 1 million rows:

create table student as select s as rollno, md5(random()::text) as name  from generate_Series(1,1000000) s;

Use Case 1 : Query for equal

Now lets query this to understand Btree benefits:

explain analyze select * from student where rollno=9090;
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..15375.79 rows=5292 width=36) (actual time=10.074..98.045 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on student  (cost=0.00..13846.59 rows=2205 width=36) (actual time=54.934..83.723 rows=0 loops=3)
         Filter: (rollno = 9090)
         Rows Removed by Filter: 333333
 Planning time: 0.197 ms
 Execution time: 98.084 ms
(8 rows)

Now create index:

create index roll_idx on student(rollno);

Lets try again:

explain analyze select * from student where rollno=9090;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Index Scan using roll_idx on student  (cost=0.42..8.44 rows=1 width=37) (actual time=0.106..0.108 rows=1 loops=1)
   Index Cond: (rollno = 9090)
 Planning time: 0.426 ms
 Execution time: 0.154 ms

Now you can see that total time taken after index is ~500 times faster than normal as in normal case it was doing a parallel scan(in older version of postgres it was also sequential scan).

Use Case 2: Query for range

explain analyze select * from student where rollno > 9090 and rollno < 10008;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Index Scan using roll_idx on student  (cost=0.42..41.52 rows=905 width=37) (actual time=0.017..0.605 rows=917 loops=1)
   Index Cond: ((rollno > 9090) AND (rollno < 10008))
 Planning time: 0.215 ms
 Execution time: 0.715 ms
(4 rows)

Now you could see this works well in case of range.

No no it is not checking for all the range value one by one , it is using the leaf property of pointing to each other in sorted order.

Use Case in case of string:

in case of string first create index and check :

create index name_idx on student(name);
explain analyze select * from student  where name = 'b170e15823193100056e09725aec94c4';
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Index Scan using name_idx on student  (cost=0.42..8.44 rows=1 width=37) (actual time=0.115..0.116 rows=0 loops=1)
   Index Cond: (name = 'b170e15823193100056e09725aec94c4'::text)
 Planning time: 0.390 ms
 Execution time: 0.151 ms

See this works perfectly well.

When to use?

  • When you are querying a table frequently on this column by filtering
  • If you want data sorted also in that order
  • Date , Integer , String are good candidates for this
  • DO NOT USE if you are searching ilike or like type of queries. There are diff. indexes for that(in next blogs)
  • DO NOT JUST CREATE A INDEX ON EACH COLUMN . This has a huge overhead for postgres (will discuss in series)

Next – Multi Column B-Tree

please subscribe for more such indepth blogs.

How and when and what to log ?

We as a developers understand the importance of coding but most of us do not know the right way to log.

While seeing others code you must have seen various patterns of logging:

  • Log everything
    • on start of any function call
    • on every function call
    • before every return statement
  • Don’t log anything
    • just don’t log
  • Log only Exceptions
    • Log only when exceptions occur

I think everybody must have seen all type of developers, lets understand when to log and also in diff. scenarios , but before that first see the purpose of logging:

Purpose of Logging

  • Provides insights about the flow of your code, what happen , what are the parameters which were passed
  • You could virtually run the entire flow of your code and understand the problems while debugging any unexpected behavior in your code.

Lets go scenario by scenario

In the following i want you see the reasoning to when to log that may help you to decide in your scenarios.

  • Http Service Call / Any third party
    • to debug any service hit , you will require the following:
      • parameters which are passed to it log using info, always check whether in your environment it is already done by framework library code (like in rest filters , tomcat logs) then do not log, generally its a good practice to handle this at framework level rather than everybody taking care of this
      • Now if you are writing pure functions (not using global variables, not modifying any variable provided) then mostly you would not require to log any thing in those functions(will be writing a blog on the pure functions), as you could drive most of the things from the input.
      • If any specific decision that will completely change the flow of the code , we should generally log the decision and reason in info
      • Log any exception which happens in to the system with type error
      • any parameter you are reading from system , generally log this with fine as you can cross check this , do not just log , not a hardline if some function is called limited number of times
      • calling another http service –
        • generally you can log the parameters here with info , it becomes easy to debug the parameters passed rather than looking into other service log if it is on other machine only and no central logging.
  • Timer or Scheduler Jobs
    • In case of timer first thing is , if timer is every some seconds log the time in which it started
    • rest more or less remain the same as in above
  • Service to Service(within same process, Spring Services) Calls
    • When service call to another service the calling service should not log generally assuming the responsibility is with the called service. rest remain same as above in http one
  • Any Http Connections
    • generally we should log the parameters passed to http with info , and also should check if some parameters not to print as they are internal and should not leak
  • Microservice to Microservice Communications
    • Assuming you would be using some trace id , if that Service logs are not available generally the request and response , keeping in mind do not print business critical parameters(like password of some sort).
  • DB Queries
    • before calling a query we should generally log with info the parameters we are passing
  • Now when to log Fatal
    • In code is something very unexpected which happens like db configuration not provided log with fatal or other api url is not configured but that should be without it use case will not work the log with fatal
  • Now when to log fine
    • Generally anytime when there are decisions taken in the code one can log on fine. The configurations read by the system we can log with fine, Some internal data structures created or modified or any lifecycle event like DB connection closed we can log such things with fine. But do not add lots and lots of log , it can kill your system.

Some of the cross cutting concerns we should know

  • Flow Id or Trace Id :
    • for any logs which are part of a single flow (ie a single thread doing it in sync ) there should be a flow id or trace id in all logs.
    • Even if the flow which are started from logically the same flow but then one async api is called (@Async in Spring) in that also one should have the same flow id passed with some additional info.
    • For distributed we you can check the link .
  • What to Log:
    • Use some trace id like user id , or object id for which a set of API or function call is done so that it becomes easy to figure out for which function was called

As we can’t cover all of them i think you the above examples help to choose the right mix.

Also please be sure when you log as lots of logging could kill your system. Do read the link.

Understand the intuition for databases

This article is for beginners who are starting their journey in software development and database looks like a jargon to them .

In this we will cover the intuition for databases , the lingo and types all will be covered in future posts.

Let’s begin by understanding through an example. Lets define our problem:

  • We are a online site which allows users to register on our website and registered users can see other registered users and send messages to them.
  • Now lets assume we are building this in an era where there are no database software available. So what will we do.

First understand our business requirement:

  • When user come on our site and register with user id and password , we should store this somewhere on a persistent storage (hard disk , available after restart).
  • We should be able to retrieve this user info from our persistent storage to validate when somebody comes for login with user id and password to check whether its correct or not.

Naive Solution 1 :

We will write user and their password in a file called "users" in the system in a sequential manner , lets look at the file:

Simple right, now if somebody want to read the userid and password , we just read the entire file check for each row , break it by semicolon ';' and check whether user id given matches the first part of the line (after breaking through semicolon) and second part matches the password provided and one.

But hey there are certain problems in the above structure (i know you have already guessed):

  • It will be very slow if user count is large (as one need to check all rows till the point user matches)
  • if somebody put semicolon in their user name( ahhh easy we wont allow it , but still there is an extra check)

So lets discuss again :

What we will do we will create multiple files for each alphabet ie user_a for user name starting with a , user_b for user name starting b and so on. this way we will reduce the size by a significant amount.

Hey but still if for a there are thousands of users still it will be a problem. Hmmm still stuck.

Lets think again:

Hey we will create a Binary Search tree type structure in File where each node points to the row on the user file (a pointer to exact location of that user in that file so that we can read that file directly), i am just showing a very very basic version for understanding down below:

Now when we do this , this solves problem to some level.

But there are so many other use cases which you will come through:

  1. delete some user – now the index is also needs to be managed
  2. backup of these files if hard disk corrupt
  3. if now i wants to store the user last name as well in this
  4. ……

There are very very few which i have written , the main point here is that all these things will be require by all companies , so database software try to solve the data persistence and query problems , so companies can focus on their business.

I think by this you got the basic intuition about the databases. Done – now go read about various type of databases which are available and what all purpose they do.

Want to be a better programmer – Read , Read , Read – But How?

If one wants to become a better programmer one thing is for sure that one needs to understand others code.

For beginners when we start writing the code, when using libraries very basic like List , Set , we get stuck in basic things like:

  • What this class do?
  • How to use this class?

Now this comes because we are missing one important trick:

  • First associate a purpose for the class be reading its definition and by name.
  • Now if you have the purpose you will be able to automatically make out what the functions in this class should be
  • Same goes for the functions first associate a purpose.

Now if you start reading the code by understanding the purpose of the class, you will be able to understand and use that class.

Now the next thing you should do is when you understood the purpose is :

  • look at their internal implementations
  • understand the data structure or variables they have declared
  • try to reason about the purpose why they have declared like that
  • Any pors / cons – alternative implementation you can think of.

I bet if one starts doing this in initial part of their programming career, this would help them:

  • to reason lots about other libraries
  • understand new libraries faster
  • also while writing new libraries one will be able to choose right data structures.

Lets see a demo of the above theory:

We all know there is Collections in java which looks somethings like this:

Java Collection Structure

Now lets start first by assigning purpose and see how things follow automatically :

  • Collection Interface – it says one can have some objects inside me.
    • so the functions should be in this class
      • one function to add a object
      • one function to remove a object
  • List Interface – it says its a collection but has a ordering for objects means you get in the order you put
    • so the extra functions in this class should be
      • get on a particular position get(int i)
      • add on a particular position
  • Set Interface – it also says it is a collection but it contains only unique objects and does not care about ordering
    • so the functions in this class
      • one is contains to check whether this already exist as it will not add a object twice

You see by just understanding the class purpose we could easily make out the functions , their purpose.

I am leaving the next part to you guys on looking at the internal implementation. Will make a next blog for this.

Naming is the most important thing

There are only two hard things in Computer Science: cache invalidation and naming things.

This quote is one of my favorite programming quotes and we will discuss the reason why here

We will discuss the naming things part here.

In your entire career in coding what you will do most is reading others code or reading others libraries or new technologies and understanding them, solving bugs in the existing code. If you agree lets move forward.

Now a small story, there was a bug in our existing system and one of my fellow programmer was working on that. The bug was in the core libraries written years back and to add to his woes there was no documentation neither external and nor on code.

He was working on it for days with no result and we were discussing the problem and he said i was unable to understand the code, now i start debugging with him and the bug was solved in minutes (not to brag about my skills), and he said how you are able to go to the problem class and function so easily. I said its just intuition from the naming of the classes and function.

Here i understood its still not natural or important for people the naming convention.

Story part over.

Lets discuss that – what is this intuition thing all about . What i learnt in my coding career that before naming a class first associate a purpose with that class and name the class such that one can understand most part of the purpose with that name and then start adding function in that class and any function if not in alignment to that class purpose then that function should not be there and also through the function name one should be able to make out its implementation without looking at the code.

Now people who read lots of code will realize that’s how most of the good code are written and most of the libraries you will come around will follow this.

Now with the intuition you got from the name without reading the code one can make out the functions and so on. This thing happen in real life as well, if i say something a screen you know mostly what it does (if you think in right context).

Similarly if we start naming our classes or functions correctly this not just help us but help other programmers who will be working on it later.

So be very careful about when you name anything (class , function ) it should convey the purpose.

Also as a developer you should and must build the intution from naming and read code.

Start naming things correctly and read code with the right intuition – dont just jump into the code.

Akka (Actor) – Hello World

Theory

In this our main aim is to a very basic hello world in Akka Actor and also realize the basic diff. between on how the actor arch. is diff.

Let’s write the code first

First add the following in the maven file:

<dependency>
	        <groupId>com.typesafe.akka</groupId>
	        <artifactId>akka-actor_2.11</artifactId>
	        <version>2.4.4</version>
</dependency>

Now write the main HelloWorldActor.java file:

import akka.actor.AbstractActor;
import akka.actor.Props;
import akka.japi.pf.ReceiveBuilder;

public class HelloWorldActor extends AbstractActor {
public HelloWorldActor() {
    receive(ReceiveBuilder
        .match(PrintMessage.class, this::sayHello)
        .build());
}

private void sayHello(final PrintMessage message) {
    System.out.println(message.getMessage());
}


public static Props props() {
    return Props.create(HelloWorldActor.class);
}

public static class PrintMessage{
    private String message;
    public PrintMessage(String message) {
        this.message = message;
    }

    public String getMessage() {
        return this.message;
    }
}
}

and in HelloWorldActorMain.java file write :


import akka.actor.ActorRef;
import akka.actor.ActorSystem;

public class HelloWorldActorMain {
	
	public static void main(String[] args) {
		ActorSystem as = ActorSystem.create("hello-world");
		ActorRef helloWorldActor = as.actorOf(HelloWorldActor.props());
		helloWorldActor.tell(new HelloWorldActor.PrintMessage("Hello World"), null);
	}
}

Let’s Understand Actor Perspective

  • We created two files here one is
    • HelloWorldActor – this represents the actor class
      • on constructur we create the behaviour for the actor ie for eg: when actor recieves an Object of PrintMessage class as a Message it will call the sayHello function
      • every actor must describe its behavior on what all messages it can process.
    • HelloWorldActorMain – this is for the main function
  • In main we first created a actor system which we generally create single only per JVM
  • Then we created a ActorRef (it does not represent the HelloWorldActor class object). This is a generic object of ActorRef which takes your messages , queues it and process the actual message on HelloWorldActor object.
  • Now sent a message to Actor – saying “Hello World” in PrintMessage object form.
  • and internally actor call a function sayHello and passing the PrintMessage object as an argument
  • we must realize here that normally we directly call function sayHello of a class and here we are calling function of ActorRef of HelloWorldActor to pass a Message
  • Also one very important thing is diff. here which is normally when we call sayHello of a class same thread go and does the work, but here diff./same thread actually does the work of processing the message
  • When we call the sayHello function our message go into a Queue type object and some other thread read and works on this.

Summary

In this hello world – we write a basic Hello World in Akka (Actor based System)

Also we realized how Actor based development is diff. from the normal function call between classes.

Java – hello world

Here we will write Hello World in java and try to understand basics of how it works.

public class HelloWorld{
    public static void main(String[] args){
         System.out.println("Hello World");
    }

}
  • How to run?
    • Save the above code in a file with name – HelloWorld.java
    • if you are using an IDE like Eclipse/Intellij , right click on file and click run
    • or from command line
      • first run javac HelloWorld.java
      • then run java HelloWorld
  • Now Let’s understand
    • First of all we have described a class here with Name HelloWorld public class HelloWorld
    • In this we described a function with name main as public static void main
    • Main function with this signature is special to java as it calls this main function when your program runs.
    • Now in this main function we have written System.out.println("Hello World");
    • With this line we are telling the System to print the text passed(which is “Hello World” in this case) and system prints this on the command line.