Tag Archives: bitmap

How do succinct data types work in libcds?

About libcds

Libcds is a library which implements low-level succinct data structures to provide the building blocks of most compressed/succinct solutions. Libcds is written in C++.

There are 4 kind of succinct data structures in libcds which are based on bitmaps:

  • DArray
  • RG
  • RRR
  • SDArray

What is common in these structures?

Each structure can be built from a BitString. BitString is the data container of libcds which consists of the following:

size_t length;
size_t uintLength;
uint * data;

“length” is the count of bits (of the whole data).
“uintLength” is the count of uints at the memory where the “data” pointer points.
“data” is pointer to uints which hold the data.

size_t is at least 32 bit on 32 bit systems and at least 64 bit on 64 bit machines (because it’s the type of the size parameter of malloc). So you can hold a maximum of 64^2 bits in this structure on a 64 bit system.

How does RG work?

Basics

RG uses a superblock table where each element tells how many ones present till it’s corresponding data group. When you create this structure you have to define a factor which tells how many data blocks must be grouped together.

libcds-rg-basics

Building the RG structure

When you build an RG structure you have to define the bitstring and a factor.

BitSequenceRG::BitSequenceRG(const BitString & bs, uint _factor);

bs is the source data
_factor means that how many uints of bs will make a superblock

This class holds the following variables:

size_t n = bitcount of bs
size_t integers = number of uints in bs
size_t factor = _factor
size_t s = 32 * factor
uint *data;
uint *Rs;

s is the number of bits in a superblock.
Rs is the array of superblocks.

The constructor function creates an uint array named “data” with length “integers”. The uints from bs copied here. The remaining space is filled with zeros.

Then it builds the array of superblocks by calling “void BitSequenceRG::BuildRank()”.

The superblock array (Rs) is filled up in this way:

  1. The first element is 0 because rank is obviously 0 before the first bit of data.
  2. The further elements are calculated as: the previous element + the number of ones in the previous block of “data” (with block size of “s” bits).

How does rank work on RG?

size_t BitSequenceRG::rank1(const size_t i1);

i1 is the index where we want the rank.

First it gets the element with index i1/s from the superblock array (Rs).
This will tell the number of ones till the block’s start (where the index maps).

libcds-rg-rank1

Second it iterates through all uints in data from block’s start to just before the uint comes where the index maps. It counts ones in these uints and sums them. The uint where the index maps has to be masked before it can count the ones in it (because the remaining part must be dismissed). Than number of ones in it are counted and summed with the previous result.

libcds-rg-rank2

This result is the rank.

How does select work on RG?

size_t BitSequenceRG::select1(const size_t x1);

We are looking for the index of the x1-th one in data.
x1 parameter is stored as x variable in this function.

First it does a binary search in the superblocks array (Rg) until it finds the greatest element which is less than x. This means that where this superblock maps (start of uint block in “data”) there are fewer ones than x.

libcds-rg-select1

Second it subtracts the found superblock element from x. So x will equal to the remaining ones which need to be counted from this block.

( in the example the found element is 0, so  x won’t change )

Third it iterates the uints from this block and subtracts the number of ones in them from x. This goes until a block which has more ones than the current x. It means that the select result will be in this uint block. This block’s value is saved and future work is done on it.

libcds-rg-select3

Fourth it iterating through 8 bit parts of this uint (by shifting and substracing number of ones of these parts) until a 8 bit part has more or equal ones than the current x.

libcds-rg-select4

Fifth it iterates through this 8 bit part (bit by bit) by shifting and subtracts the current bit value by masking from x.

It returns the bit index where the last one has been found.

Note:
RG uses popcount and popcount8 function where it counts the number of ones in a 8/32 bit variable. These functions use array “const unsigned char __popcount_tab[]”, which is a lookup table with 256 elements.

For a 8 bit value it gets the number of ones in step (be using the value as the index of the table). So this table contains the number of ones of every possible 8 bit value.

For a 32 bit value it gets the number of ones by making 4 x 8 bits from it (by masking) and summing their value in the lookup table.

Time and space requirement

n – data length in bits
factor – how many blocks to group together (for superblocks)

Building the structure:
time(n, factor) = O(n)
space(n, factor) = O(n/factor)*

* It’s O(n) in the current implementation because it makes a copy of the original data, but it can be easily fixed.

Rank:
time(n, factor) = O(factor)

Select:
time(n, factor) = O(log(n/factor)) + O(factor)

Problems and inefficiencies in RG

  • Every array elements are copied one by one. Zeroing is also done one by one. A standard library call should be used instead!
  • There are two elements which have the same meaning: constant “W” in cds-utils namespace and “b” a class variable in BitSequenceRG. They are set to 32 (the bit-length of uint). It is a bad practice to use two variables for the same purpose (maybe you forget to change both). In my notes I changed them to constant 32.
  • uint is used everywhere in the code. Because uint is not a standard C++ data type the code will not work consistently (or at all) across different platforms. On my 64 bit machine with Linux and GCC 4.8.2/Clang 3.3 uint is defined as unsigned int and it is a 32 bit data type.
  • On 64 bit platforms a 64 bit storage type (like uint64_t) should be used for performance
  • popcount functions are great however on modern processors it could be optimized to one opcode by using assembly.

Go to How do succinct data types work in libcds? – Part II

Access, rank and select queries on bitmaps

What is access?

Access tells you the value at a given index in a bitmap.

Example 1

bitmap_ex1

Example 2

bitmap_ex2

In programming sense it is a function with 2 arguments:

int access(void *bitmap, int index)
{
 // ...
 return value;
}

What is rank?

Rank tells you how many 1s are in a bitmap till a given index.

Example 1

bitmap_ex1_1

Example 2

bitmap_ex1_2

In programming sense it is a function with 2 arguments:

int rank(void *bitmap, int index)
{
 // ...
 return numofones;
}

What is select?

Select tells you at which index is the n-th 1 in a bitmap.

Example 1

bitmap_select_1

Example 2

bitmap_select_2

In programming sense it is a function with 2 arguments:

int select(void *bitmap, int numofones)
{
 // ...
 return index;
}

What is the use of these queries?

They are used in implementations of succinct data structures. With these structures data can be compressed in a way to be able to use in-place, without decompressing it first. By using these queries on these structures you can solve a lot of problems.

In these examples bitmap should be compressed of course by using a succinct data structure (like RRR) for size and an efficient rank/select implementation.

Example 1

You have a text which consists of words.

  1. You want to get the nth word of this text.
  2. You want to know how many words are in a range.

How can you accomplish this?

Make a bitmap which maps the words’ start position.
In this bitmap:
– 1 means a word’s start
– 0 means other

bitmap_text_words

With select you can solve the first problem:

index = select(bitmap, n);

Now you can read the nth word from this index.

With rank you can determine the words number in a range (a,b):

numofwords = rank(bitmap, b) - rank(bitmap, a);

Notice:
There should be a check here for the last word’s end (it can be out of range).

Example 2

There is a long line of containers at customs authority. Some are empty, others are filled with goods.

There are not enough officiers to check for all the containers (verifing if it contains what is written on it).
Check every nth occupied container!

You have a bitmap which maps containers.
In this bitmap:
– 1 means a container is filled with goods
– 0 means a container is empty

bitmap_containers

With select you can solve this problem:

index1 = select(n);
index2 = select(2*n);
// ..

Example 3

There is a large prison with lot of cells. A cell is empty or occupied by one prisoner. The cells are indexed in lines. The security system failed on one passage and all cells on this line accidentally opened.

How many prisoners did escape?

You have a bitmap which maps cells.
In this bitmap:
– 1 means a cell is occupied by a prisoner
– 0 means a cell is empty

With rank you can solve the problem:

number = rank(index_end) - rank(index_start);

PBM P4 image file format

What is PBM P4?

PBM P4 is a hybrid monochrome file format, a mixture of ASCII and binary data.
It can be opened by several imaging software (eg. Gimp).

PBM P4 structure

Source of example.pbm :

P4
# CREATOR: bitmap2pbm Version 1.0.0
8 4 @0@0

The file starts with the ASCII section and ends with the binary section.

ASCII section

“P4” is the magic word which identifies the file format
The line starts with “#” is a comment (optional).
“8 4” is the dimension of the binary data
The ASCII section ends with exactly one whitespace after dimension

A character is a whitespace if isspace() returns true for it.
A comment line must end with CR of LF.
Any number of whitespaces could appear between the tokens.

Binary section

Contains the image data, where every bit corresponds to one pixel.
The example file consists of “@0@0” which is 4 bytes => 32 bits total.
Character code of “@” is 64, which is 01000000 in binary.
Character code of “0” is 48 which is 00110000 in binary.

“8 4” dimension means 4 lines of 8 bit binary data.
pbm8x4

“16 2” dimension means 2 lines of 16 bit binary data.
pbm16x2

BUT!

“5 4” dimension means 4 lines of 8 bit binary data, where only the most significant 5 bits are used.
pbm5x4

“15 2” dimension means 2 lines of 16 bit binary data, where only the most significant 15 bits are used.
pbm15x2

With lines there are no problems, they can be anything > 0.

Bitmap manipulation programs

bitmapdd

This program creates a bitmap from a file (or device). It’s mainly used for creating a usage map of input but it can also do conversions.

You can download bitmapdd from GitHub:

$ git clone https://github.com/andmaj/bitmapdd.git

Scenario 1:

You have a file which consists of blocks of data. If a block is full of zeros than it’s free, otherwise it’s used. You want to make a bitmap from it where a bit in the file is 0 if the corresponding block is zero, otherwise it’s 1.

For example with block size set to 4:

$ bitmapdd --bs 4 --if input.dat --of output.dat

bitmapdd1

Scenerio 2:

Converting a text of zeros and ones to a binary file where every bit corresponds to one character in the original file.

$ bitmapdd --bs 1 --null 48 --if usagemap.txt --of usagemap.dat

Note: null byte has been set to 48 which is the code of character “0” in the ASCII character table. 

usagemap.txt contains text:
001100000100000001000001

usagemap.dat will contain text:
0@A

Character Decimal code Binary code
0 48 00110000
@ 64 01000000
A 65 01000001

bitmap2pbm

Creates a P4 type PBM image from a binary file. With this program you can visualize your binary (for example a usage map).

You can download bitmap2pbm from GitHub:

$ git clone https://github.com/andmaj/bitmap2pbm.git

How to use

For example creating an image of the first 10000 bytes of memtest binary:

$ head -c 10000 /boot/memtest86+-4.20 | bitmap2pbm --of memtest.pbm

You can view the image in Gimp.

bitmap2pbm

 

fat2bitmap

Creates a bitmap from FAT file system free/used clusters. The bitmap is in text format so contains zero (character 48) and one (character 49) bytes.

A zero means that the cluster is free, a one means that the cluster is used.

You can download fat2bitmap from GitHub:

$ git clone https://github.com/andmaj/fat2bitmap.git

How to use

Create a usage map of FAT file system (from filesys.iso image file)

$ fat2bitmap --if filesys.iso --of usagemap.txt

Convert the result to a binary image map with bitmapdd:

$ bitmapdd --bs 1 --null 48 --if usagemap.txt --of usagemap.dat

And finally display the usage map:

$ bitmap2pbm --if usagemap.dat --of usagemap.pbm
$ gimp usagemap.pbm &

You can also do these steps with one command:

$ fat2bitmap --if filesys.iso | bitmapdd --bs 1 --null 48 | bitmap2pbm --of usagem
ap.pbm

Note:
To create a FAT file system image file follow my guide:
http://fejlesztek.hu/create-a-fat-file-system-image-on-linux/