Big Data Lab

 

Objectives:

 

In today’s world where data is the new oil, big data can hypothetically be assumed as a system that acquires crude oil and makes fuel out of it. Big Data simply refers to a large amount of data which is of structured, semi-structured or unstructured nature. The data pool is so voluminous that it becomes difficult for an organization to manage and process it using traditional databases and software techniques. Therefore, big data not only implies the enormous amount of available data but it also refers to the entire process of gathering, storing, and analysing that data. Today’s business enterprises are data-driven and without data no enterprise can have a competitive advantage. Today, Big Data is so rampant that one has to look which are the companies that are not deploying Big Data.

 

The Big Data Lab was setup with the purpose to educate students in all aspects of large and distributed information systems and prepare them for highly skilled jobs in emerging and fast growing IT industries such as cloud computing, health care informatics, finance, data integration, and data analytics.

 

Setting up a Big Data Lab involves a combination of hardware, software, and networking components. Here's a breakdown of the hardware Lab:

Processor Name: Intel core i7 9700 (3 GHz), B660, 12 cores per processor with model name PLEXTEK DESKTOP MQ-765 and 20 in number

RAM:    DDR4 ,16GB,RAM Expandability up to( using spare DIMM Slots) 64GB

             Total HDD Capacity : 1000GB

             Total SSD Capacity:256GB

             RAM Speed:2666 MHz

Hard Disk: 1TB 7200RPM

Graphics Card: 2gb NVIDIA®GeForce GT 710,Integrated Intel HD

Access: Optical Mouse, Keyboard, 21.5"LED Backlit Monitor with monitor resolution 1920x1080

Here's a breakdown of the software present in the Lab:

Hadoop Distribution:

Cloudera:

Cloudera provides distributions of various open-source big data technologies, including Apache Hadoop, Apache Spark, Apache Hive, Apache HBase, Apache Kafka, and others. These open-source components are freely available and can be used without any licensing fees. 

Apache Hadoop:  

Apache Hadoop is free and open-source software distributed under the Apache License 2.0.

Hadoop Ecosystem Tools: Various tools that complement Hadoop are installed, such as:

  • HDFS (Hadoop Distributed File System): For distributed storage.  

  • YARN (Yet Another Resource Negotiator): For cluster management and job scheduling.

  • MapReduce: For parallel processing of large datasets. 

  • Spark: For fast and general-purpose cluster computing   

  • Hive: For data warehousing and SQL-like querying. 

  • HBase: For NoSQL database capabilities. 

  • Pig: For data flow scripting and analysis. 

  • Kafka: For real-time data streaming.    

  • Sqoop: For transferring data between Hadoop and relational databases. 

  • Flume: For ingesting log data into Hadoop.

  • Oozie: For workflow scheduling and coordination.   

Practical Project Experiments

1

Building chatbots.

5

Classifying breast cancer.

9

Exploratory data analysis.

2

Credit card fraud detection.

6

Driver drowsiness detection.

10

Gender detection and age detection.

3

Fake news detection.

7

Recommender systems.

11

Recognizing speech emotion.

4

Forest fire prediction.

8

Sentiment analysis.

12

Customer segmentation

 

 

 

Coordinator:

Dr Sharmistha Bhattacharjee

Scientist-D

sharmisthab[at]nielit[dot]gov[dot]in

English