I. Abstract: The type of Malware that has become a significant threat to businesses and individuals especially over the past few years is Ransomware, which encrypts the files on infected system/network and demands for a ransom usually in the form of bitcoins to unlock these files. Its damage costs are predicted to hit $11.5B by 2019. In an attempt to protect user’s vital data from this fatal attack, in this work, we deployed more robust, efficient, accurate and newer technologies that could detect malicious activities on a system by using different indicators, which includes analysing user’s data on Data processing platforms like Hadoop, R and Machine Learning techniques. These were tested with an aim to alert the user before a significant amount of information is lost, i.e., it narrows the data loss and also reduces the number of erroneous results by providing the user with details that could be used to flag it as either safe or unsafe. II. Introduction:• How Ransomware spreads:Ransomware, as stated above is a kind of mal-function which inhabits the user to access his/her files and demand a ransom in exchange for decrypting the files. These malicious programs mostly spread by tricking the users to click on some pop-ups which may have appeared to be safe and sound. Once such a spurious popup is clicked, a ransomware program gets installed to the system and finds files that bear extensions like JPG, XLS, PNG, PPT, DOC, etc. These files are generally important ones in any computer system. The installed program forces a user to make a definite, variable sum of payment to the perpetrators generally in the form of cryptocurrencies. The team responsible for spreading ransomware makes sure to keep their identity secretive and in order to do so they make sure that no one can keep a track of the payment they took. Attackers generally uses Tor protocol to hide their location. Along with this, ransomwares also spread via traditional mailing system. More than 60 percent of ransomware spreads via an email (specifically as a Microsoft Word document or a .ZIP file). According to Cisco Systems’ 2017 Annual Cybersecurity Report, 65 percent of email traffic is spam and about 10 percent of the global spam observed in 2016 was classified as malicious.• Financial damages due to ransomware:Businesses as well as individuals need to be fully aware of the threat posed by ransomware and make cybersecurity a top priority. According to Kaspersky, in an interval of 2 minutes at least 3 companies get hit by one type of ransomware or the other. Moreover there has been a three-fold increase in attacks over the business in the year 2016. Ransomware attacks can always result in disrupting some important systems and can destroy some confidential data. A damage of $325 million was accounted as a damage due to ransomware according to some reports from Microsoft. Cybersecurity Ventures predicted cost of damage to be $1 Billion in 2016, and there is an annual growth by 3.5 times in ransomware, in reference to Annual cybersecurity report by cisco in 2017.Other than financial impacts, there is permanent or temporary loss of sensitive or proprietary data. Moreover, the regular operations get disrupted. On an organizational level, it potentially harms the organization’s reputation. Even on paying the ransom, one may not guarantee that the encrypted files will be decrypted. In addition, it cannot be said that the malware infection has been completely eradicated from the computer system. Some information in relevance to the work: Ransomware variants can be loosely classified into the following three categories:1. These kind of Ransomware attacks can be called Denial of Service Attacks since the legitimate user of the system is locked out from accessing their files or performing any other activities till a particular code is texted to an SMS provider who charges the user with high-end rates. Sometimes the attack comes as if its from some legal authorities or from the user’s OS operators. Victim can be asked to pay via online payment systems. These kind of attacks do not generally damage the files inside the system. Below is the image of one such kind of ransomware that we developed. 2. Another type of Ransomwares are the ones that may or may not lock access to the system but will encrypt all personal/vital files and folders of the victim. Since the malware is made of complex encryption algorithms, its difficult to decrypt them back without paying to the attacker hefty amounts of ransom to obtain the decryption key. Sometimes they may delete files as well. 3. This type of ransomware are believed to be most dangerous, because in addition to the above to damages, it also infects the booting mechanism of an operating system. The victim then follows the instructions that the Ransom note provides on switching on the system. When these types of malware enter into a computer system, it is often difficult to detect them and respond well in time since there a lot of new variants that are designed every day each of which portray different behaviour, thus making it difficult to design a tool that could resist something that changes its characteristics rapidly and behaves differently every time. Moreover it is difficult to differentiate them from other safe softwares that sometimes would behave the way a ransomware infection would. In our work, the focus is on detecting the files causing the first and second type of Ransomware attacks. Therefore, in this work contribution has been made towards:1. Identifying four indicators: All these indicators were identified on the basis of how different ransomwares behave on a file system. Each of these indicators were designed to analyze particular conduct in terms of finding destructive content from target files/source codes or analyzing the type of files. Other indicators aim to keep a check on data integrity, uncommon read/write behaviours and file deletions. Each of these indicators will be explained in the next section. 2. Protect from unseen malware attacks: Because of using more dynamic techniques of Machine Learning, its classification and prediction models, it is easier now to immediately detect any type of malware that the system has not experienced before.3. Minimizing the amount of data loss: All these indicators when made to work together, they will be able to alert the user at the early stage of any harmful activities being carried out and also of whose causing that to the system.4. Safely differentiate between benign and harmful files: After the files are checked for harmful content or destructing actions on the user’s file system, which trigger these indications accordingly, the files can be further analyzed into ‘safe’ or ‘unsafe’ category by using classification algorithm(Hypothesis testing) and giving the control to the user to review its contents before classifying each file.III. Detection Mechanisms and implementation:Analyzing files for malicious contents:A java based programming framework Hadoop has been used to analyse the contents of files in the documents directory consisting of 150 files. The directory consisted 70% of XML files, 10% of xsl and another 20% of source code files generated from various application programs on the computer system. Hadoop is chosen as a platform to perform these operations for a number of reasons:It conducts a pre-processing of large data sets by removing the unrelated and excluding the less frequent words which results in faster data processing as compared to traditional ways of accessing files and searching for patterns.Since the traditional databases and warehouses reads the data in 8k or 16k block sizes, it becomes inefficient while processing large data sets. But Hadoop on the other hand has proved to work best in case of semi/unstructured data sets. Most of Hadoop’s algorithms use two-stage paradigm one of which is used in our work (Mapreduce), makes it easy to process when the data set is too large. In this approach the map reduce algorithm was deployed on the above described set of input files. A rigorous search for a string or particular words believed to be malicious was was made which successfully resulted in detecting the location of these words specifying the path of file in which they were most frequently used. It shows various forms of occurrence of the same word as shown below. After running this algorithm several times on all the files for different words, a value is generated for each word that helps us identify the level above which it should trigger this indicator stating malicious files might be present in the documentary by giving the location of such files. Data Integrity:Data integrity is ofcourse a fundamental component of information security. Like conventional filesystem, Hadoop HDFS also offers filesystem consistency and integration check. The command – fsck was used to know if the HDFS system has any corrupt blocks. Data integrity is breached when any unintentional changes are made to the files containing critical information. Since all ransomwares described in category 2 above cause data integrity attacks as all they do is make changes to the data using complex encryption algorithms, therefore this indicator plays a major role in detection mechanism. To keep a check on these kind of activities, the MD5 algorithm was implemented in C# for all the files in the directory with the ‘salt’ value that’s known only to the sender and the receiver. The output of this algorithm generates hash codes/values. Hash values helps in checking Data integrity because of its following properties:The length of the hash value determined does not depend on the size of the file. The MD5 algorithm produces a hash value of 128 bits.Even if the two files differ only by a single bit, the files will translate into completely different hash codes making it impossible to discover a pair of files that generate the identical hash values.The same hash value is generated every time this algorithm is run on it.Given the message with ‘salt’ value, it is impossible to discover the original contents of the message. Therefore carefully examining the output after every action performed on these files by some unknown programs can guarantee us of the data being intact if they produce same hash values. Hence, a large number of different hash values being produced at faster rates is an indicator of malicious programs attempting to encrypt the data. A safer solution in Hadoop is to maintain duplicate copies on the Hadoop distributed File Systems, i.e., data redundancy. But these blocks get corrupted too. After getting all the information related to those files and blocks, we run commands to locate the server in loop till all the files in the corrupted list are located. Then checking the datanode logs accurately traces out where the problem occured.A machine learning approach:In this particular approach, an attempt is made to use algorithms for classification (decision tree) to analyze and differentiate between the given set of malicious and benign files. The dataset analyzed is from UCI repository – Detect Malicious Executable (Antivirus) Data Set. Here, the training file is created with 100 plus non malicious examples and 250+ malicious samples. A sign convention of +1 stands for non-malicious dataset and -1 for malicious dataset are used. Based on a rigorous comparison and analysis (as mentioned in figure 1), 500 most commonly occurring features are extracted. On the other hand, the testing file consists of an unknown malicious executable and carry out a similar procedure (refer figure 2). On using decision tree techniques on this dataset we categorize the probability of a file being malicious or not. We believe, this approach is the most rigorous and robust of all other techniques used, since it not only helps in classifying the existing files by attentive analysis of system behaviour, but because of its ability to self-learn without being explicitly programmed, when these algorithms are exposed to an unknown set of input characteristics, they can predict if the the new set of files are malicious or benign. This technique thus, helped in reducing malware threats to a significant extent.File activity monitoring:While detecting ransomware, one would want to know which files are being encrypted, and would want to alert in case of any privileged user access to sensitive files are being made. For this reason it is important to monitor the event log or the file system log on the Operating System. In this case the System Log Viewer or Syslog was used in ubuntu 16.04 to view the File activities. Linux logs a large amount of events to the disk, where they are mostly stored in the /var/log directory in plain text. Most log entries go through the system logging daemon, syslogd, and are written to the system log. These system logs usually consists of a timestamp, user name, file name, operation (create, read, modify, rename, delete, etc.), and a result (success or failure). Therefore, analyzing this information to determine a threshold value for acceptable operations on the file. Any activity taking place in an amount larger than the expected or threshold value is flagged and its details are presented to the user to review if it is harmless or malicious.