Showing posts with label DBMS Tutorial (Database Management System [DBMS] Tutorial). Show all posts
Showing posts with label DBMS Tutorial (Database Management System [DBMS] Tutorial). Show all posts

Saturday, 14 June 2014

DBMS Hashing

Post By: Hanan Mannan
Contact Number: Pak (+92)-321-59-95-634
-------------------------------------------------------

DBMS Hashing

For a huge database structure it is not sometime feasible to search index through all its level and then reach the destination data block to retrieve the desired data. Hashing is an effective technique to calculate direct location of data record on the disk without using index structure.
It uses a function, called hash function and generates address when called with search key as parameters. Hash function computes the location of desired data on the disk.

Hash Organization

  • Bucket: Hash file stores data in bucket format. Bucket is considered a unit of storage. Bucket typically stores one complete disk block, which in turn can store one or more records.
  • Hash Function: A hash function h, is a mapping function that maps all set of search-keys K to the address where actual records are placed. It is a function from search keys to bucket addresses.

Static Hashing

In static hashing, when a search-key value is provided the hash function always computes the same address. For example, if mod-4 hash function is used then it shall generate only 5 values. The output address shall always be same for that function. The numbers of buckets provided remain same at all times.
[Image: Static Hashing]
Operation:
  • Insertion: When a record is required to be entered using static hash, the hash function h, computes the bucket address for search key K, where the record will be stored.
    Bucket address = h(K)
  • Search: When a record needs to be retrieved the same hash function can be used to retrieve the address of bucket where the data is stored.
  • Delete: This is simply search followed by deletion operation.

BUCKET OVERFLOW:

The condition of bucket-overflow is known as collision. This is a fatal state for any static hash function. In this case overflow chaining can be used.
  • Overflow Chaining: When buckets are full, a new bucket is allocated for the same hash result and is linked after the previous one. This mechanism is called Closed Hashing.
  • [Image: Overflow chaining]
  • Linear Probing: When hash function generates an address at which data is already stored, the next free bucket is allocated to it. This mechanism is called Open Hashing.
  • [Image: Linear Probing]
For a hash function to work efficiently and effectively the following must match:
  • Distribution of records should be uniform
  • Distribution should be random instead of any ordering

Dynamic Hashing

Problem with static hashing is that it does not expand or shrink dynamically as the size of database grows or shrinks. Dynamic hashing provides a mechanism in which data buckets are added and removed dynamically and on-demand. Dynamic hashing is also known as extended hashing.
Hash function, in dynamic hashing, is made to produce large number of values and only a few are used initially.
[Image: Dynamic Hashing]

ORGANIZATION

The prefix of entire hash value is taken as hash index. Only a portion of hash value is used for computing bucket addresses. Every hash index has a depth value, which tells it how many bits are used for computing hash function. These bits are capable to address 2n buckets. When all these bits are consumed, that is, all buckets are full, then the depth value is increased linearly and twice the buckets are allocated.

OPERATION

  • Querying: Look at the depth value of hash index and use those bits to compute the bucket address.
  • Update: Perform a query as above and update data.
  • Deletion: Perform a query to locate desired data and delete data.
  • Insertion: compute the address of bucket
    • If the bucket is already full
      • Add more buckets
      • Add additional bit to hash value
      • Re-compute the hash function
    • Else
      • Add data to the bucket
    • If all buckets are full, perform the remedies of static hashing.
Hashing is not favorable when the data is organized in some ordering and queries require range of data. When data is discrete and random, hash performs the best.
Hashing algorithm and implementation have high complexity than indexing. All hash operations are done in constant time.

Posted By MIrza Abdul Hannan3:24:00 pm

DBMS Indexing

Post By: Hanan Mannan
Contact Number: Pak (+92)-321-59-95-634
-------------------------------------------------------

DBMS Indexing

We know that information in the DBMS files is stored in form of records. Every record is equipped with some key field, which helps it to be recognized uniquely.
Indexing is a data structure technique to efficiently retrieve records from database files based on some attributes on which the indexing has been done. Indexing in database systems is similar to the one we see in books.
Indexing is defined based on its indexing attributes. Indexing can be one of the following types:
  • Primary Index: If index is built on ordering 'key-field' of file it is called Primary Index. Generally it is the primary key of the relation.
  • Secondary Index: If index is built on non-ordering field of file it is called Secondary Index.
  • Clustering Index: If index is built on ordering non-key field of file it is called Clustering Index.
Ordering field is the field on which the records of file are ordered. It can be different from primary or candidate key of a file.
Ordered Indexing is of two types:
  • Dense Index
  • Sparse Index

Dense Index

In dense index, there is an index record for every search key value in the database. This makes searching faster but requires more space to store index records itself. Index record contains search key value and a pointer to the actual record on the disk.
[Image: Dense Index]

Sparse Index

In sparse index, index records are not created for every search key. An index record here contains search key and actual pointer to the data on the disk. To search a record we first proceed by index record and reach at the actual location of the data. If the data we are looking for is not where we directly reach by following index, the system starts sequential search until the desired data is found.
[Image: Sparse Index]

Multilevel Index

Index records are comprised of search-key value and data pointers. This index itself is stored on the disk along with the actual database files. As the size of database grows so does the size of indices. There is an immense need to keep the index records in the main memory so that the search can speed up. If single level index is used then a large size index cannot be kept in memory as whole and this leads to multiple disk accesses.
[Image: Multi-level Index]
Multi-level Index helps breaking down the index into several smaller indices in order to make the outer most level so small that it can be saved in single disk block which can easily be accommodated anywhere in the main memory.

B+ Tree

B tree is multi-level index format, which is balanced binary search trees. As mentioned earlier single level index records becomes large as the database size grows, which also degrades performance.
All leaf nodes of B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same height, thus balanced. Additionally, all leaf nodes are linked using link list, which makes B+ tree to support random access as well as sequential access.

STRUCTURE OF B+ TREE

Every leaf node is at equal distance from the root node. A B+ tree is of order n where n is fixed for every B+ tree.
[Image: B+ tree]
Internal nodes:
  • Internal (non-leaf) nodes contains at least ⌈n/2⌉ pointers, except the root node.
  • At most, internal nodes contain n pointers.
Leaf nodes:
  • Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values
  • At most, leaf nodes contain n record pointers and n key values
  • Every leaf node contains one block pointer P to point to next leaf node and forms a linked list.

B+ TREE INSERTION

  • B+ tree are filled from bottom. And each node is inserted at leaf node.
  • If leaf node overflows:
    • Split node into two parts
    • Partition at i = ⌊(m+1)/2
    • First i entries are stored in one node
    • Rest of the entries (i+1 onwards) are moved to a new node
    • ith key is duplicated in the parent of the leaf
  • If non-leaf node overflows:
    • Split node into two parts
    • Partition the node at i = ⌈(m+1)/2
    • Entries upto i are kept in one node
    • Rest of the entries are moved to a new node

B+ TREE DELETION

  • B+ tree entries are deleted leaf nodes.
  • The target entry is searched and deleted.
    • If it is in internal node, delete and replace with the entry from the left position.
  • After deletion underflow is tested
    • If underflow occurs
      • Distribute entries from nodes left to it.
    • If distribution from left is not possible
      • Distribute from nodes right to it
    • If distribution from left and right is not possible
      • Merge the node with left and right to it.

Posted By MIrza Abdul Hannan3:12:00 pm

DBMS File Structure

Post By: Hanan Mannan
Contact Number: Pak (+92)-321-59-95-634
-------------------------------------------------------

DBMS File Structure

Relative data and information is stored collectively in file formats. A file is sequence of records stored in binary format. A disk drive is formatted into several blocks, which are capable for storing records. File records are mapped onto those disk blocks.

File Organization

The method of mapping file records to disk blocks defines file organization, i.e. how the file records are organized. The following are the types of file organization
[Image: File Organization]
  • Heap File Organization: When a file is created using Heap File Organization mechanism, the Operating Systems allocates memory area to that file without any further accounting details. File records can be placed anywhere in that memory area. It is the responsibility of software to manage the records. Heap File does not support any ordering, sequencing or indexing on its own.
  • Sequential File Organization: Every file record contains a data field (attribute) to uniquely identify that record. In sequential file organization mechanism, records are placed in the file in the some sequential order based on the unique key field or search key. Practically, it is not possible to store all the records sequentially in physical form.
  • Hash File Organization: This mechanism uses a Hash function computation on some field of the records. As we know, that file is a collection of records, which has to be mapped on some block of the disk space allocated to it. This mapping is defined that the hash computation. The output of hash determines the location of disk block where the records may exist.
  • Clustered File Organization: Clustered file organization is not considered good for large databases. In this mechanism, related records from one or more relations are kept in a same disk block, that is, the ordering of records is not based on primary key or search key. This organization helps to retrieve data easily based on particular join condition. Other than particular join condition, on which data is stored, all queries become more expensive.

File Operations

Operations on database files can be classified into two categories broadly.
  • Update Operations
  • Retrieval Operations
Update operations change the data values by insertion, deletion or update. Retrieval operations on the other hand do not alter the data but retrieve them after optional conditional filtering. In both types of operations, selection plays significant role. Other than creation and deletion of a file, there could be several operations, which can be done on files.
  • Open: A file can be opened in one of two modes, read mode or write mode. In read mode, operating system does not allow anyone to alter data it is solely for reading purpose. Files opened in read mode can be shared among several entities. The other mode is write mode, in which, data modification is allowed. Files opened in write mode can be read also but cannot be shared.
  • Locate: Every file has a file pointer, which tells the current position where the data is to be read or written. This pointer can be adjusted accordingly. Using find (seek) operation it can be moved forward or backward.
  • Read: By default, when files are opened in read mode the file pointer points to the beginning of file. There are options where the user can tell the operating system to where the file pointer to be located at the time of file opening. The very next data to the file pointer is read.
  • Write: User can select to open files in write mode, which enables them to edit the content of file. It can be deletion, insertion or modification. The file pointer can be located at the time of opening or can be dynamically changed if the operating system allowed doing so.
  • Close: This also is most important operation from operating system point of view. When a request to close a file is generated, the operating system removes all the locks (if in shared mode) and saves the content of data (if altered) to the secondary storage media and release all the buffers and file handlers associated with the file.
The organization of data content inside the file plays a major role here. Seeking or locating the file pointer to the desired record inside file behaves differently if the file has records arranged sequentially or clustered, and so on.

Posted By MIrza Abdul Hannan3:10:00 pm

DBMS Storage System

Post By: Hanan Mannan
Contact Number: Pak (+92)-321-59-95-634
-------------------------------------------------------

DBMS Storage System

Databases are stored in file formats, which contains records. At physical level, actual data is stored in electromagnetic format on some device capable of storing it for a longer amount of time. These storage devices can be broadly categorized in three types:
[Image: Memory Types]
  • Primary Storage: The memory storage, which is directly accessible by the CPU, comes under this category. CPU's internal memory (registers), fast memory (cache) and main memory (RAM) are directly accessible to CPU as they all are placed on the motherboard or CPU chipset. This storage is typically very small, ultra fast and volatile. This storage needs continuous power supply in order to maintain its state, i.e. in case of power failure all data are lost.
  • Secondary Storage: The need to store data for longer amount of time and to retain it even after the power supply is interrupted gave birth to secondary data storage. All memory devices, which are not part of CPU chipset or motherboard comes under this category. Broadly, magnetic disks, all optical disks (DVD, CD etc.), flash drives and magnetic tapes are not directly accessible by the CPU. Hard disk drives, which contain the operating system and generally not removed from the computers are, considered secondary storage and all other are called tertiary storage.
  • Tertiary Storage: Third level in memory hierarchy is called tertiary storage. This is used to store huge amount of data. Because this storage is external to the computer system, it is the slowest in speed. These storage devices are mostly used to backup the entire system. Optical disk and magnetic tapes are widely used storage devices as tertiary storage.

Memory Hierarchy

A computer system has well-defined hierarchy of memory. CPU has inbuilt registers, which saves data being operated on. Computer system has main memory, which is also directly accessible by CPU. Because the access time of main memory and CPU speed varies a lot, to minimize the loss cache memory is introduced. Cache memory contains most recently used data and data which may be referred by CPU in near future.
The memory with fastest access is the costliest one and is the very reason of hierarchy of memory system. Larger storage offers slow speed but can store huge amount of data compared to CPU registers or Cache memory and these are less expensive.

Magnetic Disks

Hard disk drives are the most common secondary storage devices in present day computer systems. These are called magnetic disks because it uses the concept of magnetization to store information. Hard disks consist of metal disks coated with magnetizable material. These disks are placed vertically a spindle. A read/write head moves in between the disks and is used to magnetize or de-magnetize the spot under it. Magnetized spot can be recognized as 0 (zero) or 1 (one).
Hard disks are formatted in a well-defined order to stored data efficiently. A hard disk plate has many concentric circles on it, called tracks. Every track is further divided into sectors. A sector on a hard disk typically stores 512 bytes of data.

RAID

Exponential growth in technology evolved the concept of larger secondary storage medium. To mitigate the requirement RAID is introduced. RAID stands for Redundant Array of Independent Disks, which is a technology to connect multiple secondary storage devices and make use of them as a single storage media.
RAID consists an array of disk in which multiple disks are connected together to achieve different goals. RAID levels define the use of disk arrays.
  • RAID 0: In this level a striped array of disks is implemented. The data is broken down into blocks and all blocks are distributed among all disks. Each disk receives a block of data to write/read in parallel. This enhances the speed and performance of storage device. There is no parity and backup in Level 0.
  • [Image: RAID 0]
  • RAID 1: This level uses mirroring techniques. When data is sent to RAID controller it sends a copy of data to all disks in array. RAID level 1 is also called mirroring and provides 100% redundancy in case of failure.
  • [Image: RAID 1]
  • RAID 2: This level records the Error Correction Code using Hamming distance for its data striped on different disks. Like level 0, each data bit in a word is recorded on a separate disk and ECC codes of the data words are stored on different set disks. Because of its complex structure and high cost, RAID 2 is not commercially available.
  • [Image: RAID 2]
  • RAID 3: This level also stripes the data onto multiple disks in array. The parity bit generated for data word is stored on a different disk. This technique makes it to overcome single disk failure and a single disk failure does not impact the throughput.
[Image: RAID 3]
  • RAID 4: In this level an entire block of data is written onto data disks and then the parity is generated and stored on a different disk. The prime difference between level 3 and 4 is, level 3 uses byte level striping whereas level 4 uses block level striping. Both level 3 and 4 requires at least 3 disks to implement RAID.
[Image: RAID 4]
  • RAID 5: This level also writes whole data blocks on to different disks but the parity generated for data block stripe is not stored on a different dedicated disk, but is distributed among all the data disks.
  • [Image: RAID 5]
  • RAID 6: This level is an extension of level 5. In this level two independent parities are generated and stored in distributed fashion among disks. Two parities provide additional fault tolerance. This level requires at least 4 disk drives to be implemented.
  • [Image: RAID 6]

Posted By MIrza Abdul Hannan3:09:00 pm

DBMS Joins

Post By: Hanan Mannan
Contact Number: Pak (+92)-321-59-95-634
-------------------------------------------------------

DBMS Joins

We understand the benefits of Cartesian product of two relation, which gives us all the possible tuples that are paired together. But Cartesian product might not be feasible for huge relations where number of tuples are in thousands and the attributes of both relations are considerable large.
Join is combination of Cartesian product followed by selection process. Join operation pairs two tuples from different relations if and only if the given join condition is satisfied.
Following section should describe briefly about join types:

Theta (θ) join

θ in Theta join is the join condition. Theta joins combines tuples from different relations provided they satisfy the theta condition.
Notation:
R1 θ R2
R1 and R2 are relations with their attributes (A1, A2, .., An ) and (B1, B2,.. ,Bn) such that no attribute matches that is R1 ∩ R2 = Φ Here θ is condition in form of set of conditions C.
Theta join can use all kinds of comparison operators.
Student
SIDNameStd
101Alex10
102Maria11
[Table: Student Relation]
Subjects
ClassSubject
10Math
10English
11Music
11Sports
[Table: Subjects Relation]
Student_Detail =
STUDENT Student.Std = Subject.Class SUBJECT
Student_detail
SIDNameStdClassSubject
101Alex1010Math
101Alex1010English
102Maria1111Music
102Maria1111Sports
[Table: Output of theta join]

Equi-Join

When Theta join uses only equality comparison operator it is said to be Equi-Join. The above example conrresponds to equi-join

Natural Join ( ⋈ )

Natural join does not use any comparison operator. It does not concatenate the way Cartesian product does. Instead, Natural Join can only be performed if the there is at least one common attribute exists between relation. Those attributes must have same name and domain.
Natural join acts on those matching attributes where the values of attributes in both relation is same.
Courses
CIDCourseDept
CS01DatabaseCS
ME01MechanicsME
EE01ElectronicsEE
[Table: Relation Courses]
HoD
DeptHead
CSAlex
MEMaya
EEMira
[Table: Relation HoD]
Courses ⋈ HoD
DeptCIDCourseHead
CSCS01DatabaseAlex
MEME01MechanicsMaya
EEEE01ElectronicsMira
[Table: Relation Courses ⋈ HoD]

Outer Joins

All joins mentioned above, that is Theta Join, Equi Join and Natural Join are called inner-joins. An inner-join process includes only tuples with matching attributes, rest are discarded in resulting relation. There exists methods by which all tuples of any relation are included in the resulting relation.
There are three kinds of outer joins:

Left outer join ( R  S )

All tuples of Left relation, R, are included in the resulting relation and if there exists tuples in R without any matching tuple in S then the S-attributes of resulting relation are made NULL.
Left
AB
100Database
101Mechanics
102Electronics
[Table: Left Relation]
Right
AB
100Alex
102Maya
104Mira
[Table: Right Relation]
Courses  HoD
ABCD
100Database100Alex
101Mechanics------
102Electronics102Maya
[Table: Left outer join output]

Right outer join: ( R  S )

All tuples of the Right relation, S, are included in the resulting relation and if there exists tuples in S without any matching tuple in R then the R-attributes of resulting relation are made NULL.
Courses  HoD
ABCD
100Database100Alex
102Electronics102Maya
------104Mira
[Table: Right outer join output]

Full outer join: ( R  S)

All tuples of both participating relations are included in the resulting relation and if there no matching tuples for both relations, their respective unmatched attributes are made NULL.
Courses  HoD
ABCD
100Database100Alex
101Mechanics------
102Electronics102Maya
------104Mira
[Table: Full outer join output]

Posted By MIrza Abdul Hannan3:08:00 pm