Abstract— The generation of data was confined and it was rare that online
systems were counted on for storing the data in earlier systems. All was on
paper. With the rise of technology, a digital revolution gas taken place. All
the industries, varying from hospitals to small business to social media
websites, are working on the concept of sending data to database, further on to
servers, and getting back the response from the server that data has been
processed either successfully or unsuccessfully through the reflected response
message. On the other hand, the worst part, again ironically, is that to handle
the enormous data, a high tech server or the high tech workplace has to be
generated. For the handling of this data, the concept of Big Data has come in
limelight. The Big Data has rapidly developed and achieved success in almost
all the working fields. It helps to synchronize the data sequentially as the
action or the processing has been done. The implementation of Big Data has to
be done through simulations, the concept of Artificial Intelligence. So, it’s
evident that handling data storage has migrated from manual processing to
artificial intelligence. This paper aims to study the simulation of Big Data at
a large scale and how it can help to store the enormous data virtually
effectively and efficiently.
Keywords—Data; Simulation; Storage; styling; insert (key words)
concept of Big Data is emerging and paving its way through technological era.
It is useful because it serves many purposes related with data storage. There
are 7.6 billion people in the world (as of December 2017) surfing over 1
billion websites (as of October 2014, confirmed by NetCraft). From an
individual to a large organization, there are “n” number of active users who
play with the data. Every second, “n” number of data is created, modified,
deleted or simply accessed. This is possible because of the huge servers, which
is usually compilation of a large number of databases 1, 4.
to American Express, there is a virtual communication between a system and the
server through the mode of request-response. A request from the user is sent to
the server, server processes the request from the database, and then a response
is generated from database to server, which is forwarded to the user in the
readable format. Every time, a data or a request is sent, it attaches itself
with a unique value that is generated at the moment, just like OTP (One Time
Password). These data may be in any format, so it is converted into structures
(which include queues, dequeues, stacks, circular queues). This data, thus is
permanently stored in database. The database can be understood as a data centre
or a centre repository where all the generated data is compiled in its
II. Literature Review
Author 9 concluded that complex
applications that run on high-end systems generate an enormous amount of data
that needs to be managed and analysed to get insights. Data costs (performance,
latency, energy) are going on dominating. Traditional data management methods
are losing existence. He also mentioned that there are many challenges which
Big Data face like programming, mapping and scheduling, control and data flow,
automatic runtime management. To address these challenges, hybrid data staging,
in-situ workflow execution, and dynamic code deployment can be used. Lee et al.
8 suggested, in their experiments, ARLS (After action Reviewer for
Large-scale Simulation data) improved data processing time to a greater extent
comparing to the traditional tools. Researchers 10 suggested that Big Data
technology can also be used to study complicated problems relating to weapon
systems acquisition, combat analysis, and military training. They reviewed
several large-scale military simulations producing Big Data for a variety of
Lange et al. 2 introduced VELaSSCo
project. They mentioned that simulations produce exponentially growing volumes
of data, which is not possible to be stored anymore with existing IT systems.
Therefore, VELaSSCo was aimed to develop new concepts for integrated end-user
visual analysis with advanced management and post-processing algorithms for
engineering applications, with special dedication to scalable, real-time and
petabyte level simulation. Yuan et al. 11 focused on microscopic traffic
simulation. They proposed a cross-simulation method which can be applied to
collect the mass data in normal situations into large-scale traffic evacuations
which will enhance supporting information for emergency decision.
III. Objectives of The Study
Large scale simulation is becoming an
increasingly important tool to evaluate Big Data because of the increasing
complexity level of Internet. Parallel and distributed simulation is one of the
most appropriate approaches to the simulation scalability problem, but it
requires expensive hardware and has high overhead. Thus, it’s the matter of
study that how can we put simulation effectively in large-scale industries.
Some of the basic objectives of the
study undertaken are as follows:
To create deep understanding of big data in large scale
To understand how the simulation can work best for big data.
IV. Discussion on Big Data
take a simple example. You go to a hotel. You order the waiter for some food,
say Chowmein and Manchurian. The waiter takes the same order to the kitchen.
The kitchen receives the order and processes the food. The kitchen gives the
food to the waiter. And then your order is served at your table. This the
simple way in which request-response communication has taken place.
let’s understand how data is generated. Let us take the example of Facebook, a
social media channel. Worldwide, there are over 2.07 billion monthly active
Facebook users, as of 1st November, 2017. There are 1.15 billion mobile daily
users and 1.37 billion daily users who log onto Facebook. Every 60 seconds on
Facebook: 510,000 comments are posted, 293,600 statuses are updated and 136,000
photos are uploaded (Source: The Social Skinny). Just imagine the quantity of
data that is generated per second, per minute, per hour, per day, per week, per
month and per year. It’s enormous, simply enormous. If we talk about other
social media channels and their daily users, Twitter has 284 million active
users, WhatsApp has 500 million active users, Instagram has 600 million active
users, and so on and so forth 5.
to handle this enormous data, there was a requirement of something more
developed and efficient method than the traditional methods used for the data
storage. Earlier, the data generated was confined to some GBs (gigabytes) only.
But now, there are millions of (TB) terabytes data that is generated. Moreover,
it has to be processed instantly. So, the concept of Big Data came into play.
Data scientists often use N-V (volume, velocity, variety, value, veracity,
etc.) dimensions to understand Big Data. Fig. 1 shows the graphical overview of
different important factors, which affect to the big data system architecture.
Fig. 1 Various Aspects for Data
The Big Data is increasing day by data
which is making it difficult to estimate its definite size. It varies from
different fields and expands over time. Many different Internet companies
generate new data in terabytes every day and the database in use, sometimes, exceed
petabyte. Thus, large volume is one of
the basic features of Big Data. According to International Data Corporation
(IDC), the volume of digital data in the world reached zetabytes (1 zetabyte =
230 terabytes) in 2013; furthermore, it will double every two years by 2020.
Data explosion also means that the speed
of data generation is very fast. Thus, they need to be processed timely. For
many E-commerce applications, the processed information must be available in a
few seconds; otherwise the value of data will diminish.
Data has variety of types and formats. Tradition structured data saved in
database possess regular format, that is, date, time, amount, and string,
whereas the unstructured data such as text, audio, video, and image are main
formats of Big Data. The unstructured data can also be any custom web page,
blog, photo, comment on commodity, questionnaire, application log, sensor data,
etc. These unstructured data may express human-readable contents but, in
actual, machine is unable to process them. Usually, they are saved in file
system or NoSQL (Not Only SQL) database which has simple data model such as
Key-Value Pair. Variety may also mean that Big Data comes from various sources.
For example, the data for traffic analysis may come from fixed network camera,
bus, taxi, or subway sensors.
Data can produce valuable insight for owner. The insight is valuable because it
helps to predict the future, create new chance, and reduce risk or cost. In this
way, Big Data can change or improve people’s standard of living.
means that only the trustworthy data should be used; otherwise the decision
maker may get incorrect knowledge leading to wrong decisions. For example, the
online review from customer is important to rank a commodity, and if some
people make fake comments or score deceptively for profit, the result will
influence negatively the ranking and customer’s choice.
V. Simulation in Big Data
Big Data is the concept that is “Beyond Data”.
From simple records of criminals to GPS facility in mobile to Google Translate,
each require the use of Big Data. In short, it has now become a part of daily
life of all the fields. To process the Big Data, simulation, another concept
derived from artificial intelligence, is required. Simulation helps to recreate
the data wherever and whenever asked for, from the server or the database
indeed. It is useful in decision making but is somewhat difficult to be applied
by layman. Simulation model systems are very simple agents and their
interaction varies from small numbers to large numbers. It provides the “right”
information to make better decisions: predict and how to influence. If we
conduct simulation in “traffic and infrastructure system with reference to
business”, which is considered as one of the large-scale simulations,
simulation is confined to three basic questions.
1. What is the best preparation for a major
2. How do changing travel pattern affect
3. Which incentives give young people greater
access to the housing market?
is a representation of the functioning of a system or process. Through
simulation, a model may be implanted with unlimited variations, producing
complex scenarios. These capabilities allow analysis and understanding of how
individual elements interact and affect the simulated environment. Example of a
simulation: three-dimensional model of an armored vehicle which moves across a
model of terrain over time. The tool that executes the simulation is a simulator. There are various simulators
available as follows:
Live Simulations – Real people operate
in the real world.
Virtual Simulations – Real people
operate in synthetic worlds.
Constructive Simulations – Simulated
entities operate in synthetic worlds.
Undefined Simulations – Simulated
entities are subjected to real world environments.
can be applied via experimentation, operational Planning, training, missions
rehearsal, support to the conduct of operations, life cycle management. Fig. 2
displays the process for a simulation study analysis.
Fig. 2 Flow Chart of the Simulation Process
we consider data lifecycle, we can divide the simulation process into three
consecutive phases: data generation, data management, and data analysis. Data
generation deals with the kind of data that should be created and ways to
create valid data in the given amount of time. Data management is concerned
with collecting large amounts of data without disturbing the normal simulation
and providing available storage and efficient processing capability. Data
analysis makes use of various analytic methods to extract value from the
simulation result 3.
A. Challenges in Big Data Simulation
The major problem with Big Data is that it
is beyond the grasp of humans. Big Data is getting bigger and bigger. And, we
are getting trapped in our own data. The rise of ubiquitous and more and more
extremes communicating in their own feedback loops with the cloud is increasing
data exponentially. The extremes- include all those wearables and handhelds, or
mobile sensors- without servers in the cloud to communicate with are pretty
useless. So, one can imagine how much of the Internet things depend on the back
end or “the cloud”. These machines in the cloud without the
interference of human inputs are like insensitive people, even when they are in
clusters of thousands and easy to reach out to source 9.
a decade ago, Google developed a way that Yahoo, later, cloned to spread data
out across huge commodity clusters which began to mine big datasets on an
ad-hoc batch cost effectively. That
method has evolved as Hadoop. If we talk
about the conventional database front, there are ways to measure analytics
using non-relational and modified relational database technologies. Only a
small fraction of the population is capable enough to use these methods to make
sense of big data at all. Though there are many layers of understanding that
humans have to build with the data they’re generating, only the surface of each
layer is accessible to the population at large. Still, there is a lot of work
that is required, most of it at its base. So, we can imagine the dilemma here
as a stack of challenges. Among these challenges are the following:
means to identify what data is generated.
It give efficient ways to find the specific data that can
help one to use it for the better.
and simulation: These are the intelligent ways which help in
modeling the problems Big Data can solve so that human inputs can result in
are the effective and efficient ways to understand the data so that it’s
relevant to specific individuals and groups.
These are the operative ways to analyze and visualize the results of the
streaming and processing: These let us explore the efficient
ways for taking human inputs and act on streams of Big Data to be able to
extract insights from it.
challenges are just the surface of the problem. There are sub-challenges under
challenges. And each challenge requires its own level of understanding. Because
of the increasing totality of the larger problem, we are inefficient at
targeting resources to solve specific Big Data challenges. Each individual
working on the problem typically only sees a few pieces of the problem. And
then it’s obvious to understand what humans want and need to begin with, or
what the world needs to sustain life at scale. After all, those are the more
fundamental problems we’re all trying to deal with them 6, 7.
VI. Research Methodology
In our research, we are considering one of
the largest social media website, Facebook. Worldwide, there are over 2.07
billion monthly active Facebook users, as of 1st November, 2017. There are 1.15
billion mobile daily users and 1.37 billion daily users who log onto Facebook.
Every 60 seconds on Facebook: 510,000 comments are posted, 293,600 statuses are
updated and 136,000 photos are uploaded (Source: The Social Skinny). We took
the sample size of 25 people in Lucknow who are daily active users of Facebook.
There are basically two variables that we have considered:
1. Number of Like actions in a day
2. Amount of data generated.
The study is conducted on the consumers of
age group between 25 to 35 years in Lucknow City. Total 50 sample size was
chosen. A well-designed questionnaire was prepared with relevant questions
according to the objectives of the study for collecting primary data from
respondents. Various aspects were covered in the questionnaire which include Like actions and Photo uploading
actions. A convenient random sampling was adopted for covering the entire
VII. Experimental Result
With the data, we collected through the
interview session of these 25 people, we saw that in an average, there are 30 Like actions by a person in a day. We
already know that one Like action on
Facebook generates which roughly generates 5 megabytes of data on Facebook
server. (It’s calculated on the basis of the fact that 2.7 billion Like actions generate 105 terabytes of
data each half hour). If we transform this average data into a real number, the
total number of Like actions by 25 people are 950 Like actions.
If 25 people are generating 5 megabytes of
data per day on a single website, it’s hard to imagine how much data is
generated by 7.6 billion people in the world surfing over 1 million websites.
So, to manage this data, Big Data concept has been applied. Simulating this Big
Data at a large-scale is worth its cost.
querying, and maintaining Big Data is extremely expensive. There are three
requirements of data for it to be useful: It should be voluminous, be of a high
velocity, and be of a lot of variety. Digital product companies such as
Facebook, Ebay, Paypal, etc, must store and recall each record of data to
deliver their products. For them, Big Data is a necessity. To add to it, their
data meets the three criteria – there are a million of users who generate data
of different file types that need to be queried. A company that doesn’t utilize
Big Data to create core products may find it hard to justify the cost. Querying,
storing and maintaining large sets of data is expensive and also time consuming.
Unless and until, it’s absolutely necessary and those three conditions are met,
Big Data is not worth the cost.
Zeigler, B. P., Sarjoughian, H. S., Duboz,
R., & Souli, J. C. (2013). Guide to modeling and simulation of systems of
systems. Springer London.
2 Lange, B., & Nguyen, T. (2014). Big Data
architecture for large-scale scientific computing. In world academy of
3 Bowman, C. N., & Miller, J. A. (2016).
Modeling traffic flow using simulation and Big Data analytics. In Winter
Simulation Conference (WSC), pp. 1206-1217, IEEE.
4 Riedel, E., Faloutsos, C., Gibson, G. A.,
& Nagle, D. (2001). Active disks for large-scale data processing. Computer,
5 Simos, G. (2015). How Much Data is Generated
Every Minute on Social Media. We Are Social Media, 19.
6 Ko?odziej, J., González-Vélez, H., & Wang,
L. (2014). Advances in data-intensive modelling and simulation.
7 J. Manyika, M. Chui, B. Brown et al., Big
Data: The Next Frontier for Innovation, Competition, and Productivity.
8 Kangsun Lee, Joonho Park, A Hadoop-Based
Output Analyzer for Large-Scale Simulation Data.
M. (2014). Big data challenges in simulation-based science. In [email protected] HPDC (pp.
10 Song, X., Wu, Y., Ma, Y., Cui, Y., &
Gong, G. (2015). Military simulation big data: background, state of the art,
and challenges. Mathematical Problems in Engineering.
11 Yuan, S., Liu, Y., Wang, G., & Zhang,
H. (2014). A cross-simulation method for large-scale traffic evacuation with
big data. In International Conference on Web-Age Information Management, pp.