ABSTRACTIn the past few years there has been an explosive growth in the quantity of pictures that are delivered every day.Technological advancements, creation of the World Wide Web are some factors that have contributed to this sudden rise in image data. This vast collection of images is hard to be handled with conventional methods of image processing. So, to overcome this problem, there was a need for a new method which could handle this huge amount of images and also provide accurate results to the user. Therefore, we have portrayed a novel strategy for ‘Content Based Image Retrieval Using Hadoop’ which will handle large amount of data with the principle of parallel processing.KEYWORDS: Content Based Image Retrieval (CBIR), Feature Extraction, Image Retrieval,HDFSINTRODUCTIONThe last few years has seen an increasing interest in the potential of digital images because of the rapid growth of imaging in World Wide Web. The process of locating a desired image from a large and varied collection can be a cumbersome task.Earlier, the method called Text Based Image Retrieval(TBIR)1 was used for the image retrieval process.In this method, images are retrieved on the basis of keywords and tags associated with it and those tags were to be stored in the database with the respective image manually. Thus it is not possible to assign keyword to huge amount of images. And to overcome this drawback, Content Based Image Retrieval(CBIR) was introduced.It is a technique for retrieving a specific image from database based on the content of the image.Content is nothing but the features like shape,texture,colour etc.2. BACKGROUND2.1 Growth of Digital Imaging Unparalleled growth in number and availability of images in the twentieth century can be attributed to the use of images in all walks of life. Images are being used in all sectors , especially journalism , medicine,education and entertainment 2. Technological inventions are the major cause in the sudden increase of image data. Images,pictures were being used earlier as well but technological advancements allowed their use in the digital form as well. In the early 1990’s , the creation of the World Wide Web gave a sudden boost to the exploitation of digital images.2.2 Need of Image Data Management With the sudden increase in the number of digital images being produced daily. A mechanism was required for proper management of this vast collection of images. The process of digitisation did not itself make image management an easy task. Some form of indexing and cataloging was needed to make the storage and retrieval process – speedy and accurate. One of the main problems of image data management was to locate a desired image in a large collection2. Journalists,engineers, designers all needed some kind of access by image content. 2.3 What is CBIR ?Content Based Image Retrieval is a technique of retrieving a desired image from a vast collection of images based on the content of the image. The retrieval of images is based on features – colour, shape and texture. Before the introduction of CBIR3 , Text Based Image Retrieval was used. Each image in the database was associated with a keyword or tag manually. Since it was a manual process it was very time consuming. Moreover the limitation with Text Based Image Retrieval was that it required the user to annotate the image with the text (metadata) that is considered relevant. However different people would perceive the same image differently , this would lead to mismatches in the retrieval process later. To overcome these limitations,Content Based Retrieval was proposed in the early 1990’s. It aimed at extracting certain features from the image rather than relying on human annotation with metadata. Hence CBIR was more accurate and faster retrieval process.3. CURRENT TECHNOLOGIES FOR IMAGE RETRIEVALThe last decade saw a number of commercial image management systems being developed. These systems did not have CBIR facilities. They were based on text keywords that required a human indexer to associate each image with a keyword.Thus the Text Based Image Retrieval which was being used, was later on replaced by Content Based Image Retrieval. It is based on the content of images that include colour , shape and texture.It is efficient for low dimensions but inefficient for high dimensions as effective storage and speedy retrieval is needed and traditional data structures are not sufficient. So, Content Based Image Retrieval Using Hadoop will be efficient in storing content and the speedy, accurate retrieval of images will be possible. Three commercial CBIR systems are now available – IBM’s QBIC, Virage’s VIR Image Engine, and VisualSeek 4.4. PROPOSED WORK4.1 Image Collection Using ScrapingScraping is a process of extracting a large amount of information from a website. Using this technique relevant images will be extracted from the required website for further image processing.4.2 Storage in HDFSUploading images into the database is the foremost step for the required image retrieval. The images extracted from scrapping will be uploaded into the database of Hadoop5 which is known as HDFS. It involves two main stages which are as follows :4.3 Feature Extraction (Colour Histogram) After the image is uploaded, its color features6 are calculated i.e. RGB values for each pixel of image.For every image in the database , the color histogram is calculated . The color distribution in images is represented by Color Histograms78. The color histogram of each image is then stored in the database. 4.4 Searching Of ImagesThe processes included in upload phase are same for search process9. The user provides a query image. It is also searched by color method where the color histogram values are compared with the histogram values stored in the database of HDFS.The system goes through all images in the database to find and retrieve those images whose histogram matches those of the query most closely.4.5 Retrieval of matched imagesWhen user uploads a query image, similarity comparisons are done based on the color histograms and thus, images similar to the query image are extracted from the HDFS.Formula for Histogram comparison10 Where, hist1 – histogram for the query image hist2 – histogram for the image in the database The images in the database whose histogram when compared with the histogram of the query image gives minimum distance will be displayed at the output panel. • The Input is the Query Image provided by the user.• The Output is the set of matched images from the database having similarity features with the Query Image.5. SYSTEM ARCHITECTURE FIG 1. ARCHITECTURE OF PROPOSED SYSTEMThe system stores an image in the HDFS .The image features i.e. color histogram are extracted. The extracted features are stored alongside the image in HDFS. When the user uploads a query image, the features of the query image are extracted and matched with the feature vectors of stored images in the database.The images with the least distance are retrieved and displayed to the user. 6. CONCLUSIONWith the increase in the number of digital devices being used and the growth of Internet, thousands of images are being added daily in the image database . The need for efficient retrieval of an image from a large collection is shared by many professionals including design engineers, journalists etc. It has been acknowledged that there always remains a room for improvement in currently used systems.The images need to be stored and retrieved in an effective and efficient manner. Also, searching time is most important for any search method while searching in a large dataset. So, by using ‘Content Based Image Retrieval Using Hadoop’ image retrieval will be a speedy and efficient process.