Outsoutce web data mining Archives | Outsource Bigdata Blog

Web data extraction: Big Data – Hadoop way

A few years back, it was all manual data mining and it took long long days for almost all small and medium players in the market for web data mining. Today, technology is evolving a lot and we are in an era of Big data and manual data mining is no more a right method and it is mostly about automation tools, custom scripts, or Hadoop framework.

Now, let us discuss something about web data extraction. It is a process of collecting data from World Wide Web using some web scrapper, crawler, manual mining, etc. A web scrapper or crawler is a cutting tool for harvesting information available on internet. In other word web data extraction is a process of crawling websites and extract data from that page using a tool or programming. Web extraction is related to web indexing which refers to various methods of indexing the contents of web page using a bot or web crawler. A web crawler is an automated program, script or tool using that we can ‘crawl’ webpages to collect multiple information from websites.

In the world of big data, data comes from multiple sources and in huge amount. In which one source is web itself. Web data extraction is one of the medium of collecting data from this source i.e. web. Companies which are leveraging big data technology are using crawlers or programming to collect data. These data comes in bulk i.e. billions of records, or as a data dump. So, it needs to treat as big data and bring into Hadoop Eco system to get quick insight from it.

There are multiples areas where companies can explore web data extraction. Some areas are:

  • In ecommerce, companies use web data extraction to monitor their competitor price and improve their product attributes. They also fetch data from different web sources to collect customer review and using Hadoop framework they do analysis – including sentiment analysis.
  • Media companies use web scraping to collect recent and popular topics of interest from different social media and popular websites.
  • Business directories use web scraping to collect information about the business profile, address, phone, location, zip code, etc.
  • In healthcare sector, health physician scrap data from multiple websites to collect information on diseases, medicine, components, etc.

When companies decide to go for web data extraction today, then they move ahead thinking about big data because they know that data will come in bulk i.e. in millions of records will be there and it will be mostly in semi or unstructured format. So, we will need to treat it as big data and use Hadoop framework and tools for converting it for any decision making.

In this whole process, first step is web data extraction, that can be done using different scraping tools available in market (there are free and paid tools are available) or create custom script using programming language with the help of expert in scripting language like Python, ruby, etc.

Second step is to find insight from the data. For this, first we need to process the data using the right tool based on the size of the data and availability of the expert resources. Hadoop framework is the most popular and highly used tool for big data processing. Also, for sentimental analysis of those data, if needed, we need MapReduce which is one of the components of big data (Hadoop).

To summarize, for web data extraction, we can choose different tools for automation or develop scripts using programming language. Developing a script is often minimize effort as it is reusable with minimal modification. Moreover, as the volume of web data is huge-what we extract, it is always advisable to go for Hadoop framework for quick processing.

Web Data Mining: Explore immense Automation Potential using Python and R

Today, massive amount of data is uploaded in web-world creating huge new and exhilarating business opportunities to small and medium size companies. However, collecting all of the required data is only one part of the storyline. Mining and converting these data into actionable is where real business value lies. The overall goal of web data mining process is extract information from various web sources and transform it into an understandable structure for further processing. The task of Data mining is to mine or analyse a large quantity of data using automatic, semi-automatic or manual ways.

In general, there are mainly two ways for data mining. First one is traditional way, manual data mining and other one – we could call automated or semi-automated data mining. Manual mining is a time taking and long process where mining is done one by one or say- by each record or piece of information. It needs a lot of time and a lot of efforts. Whereas automated web data mining is a process wherein most of the repeated tasks can be converted into simple logic based script which can scrap all web data as desired. As its name shows that things can be automated – say close to 100%. Whole mining process runs automatically using some algorithm. It is the approach that companies prefer to use for web data data mining.

To make the web data mining process automated, we can follow different methods based on the web structure, data format and size; it may require a custom made language script -script can be in R, Python, etc. and API. We can define process using any scripting language and run it for sourcing the entire data from websites. It will scrap all data whatever we instructed in the code.

We could leverage Python or R. Both are well-known for scripting – data mining especially used in Big data projects.  The beauty of python is that it is a user friendly language. A person from non-technical background can quickly learn and understand it without much difficulties. Another benefits of using Python is that ‘number of lines in script. If you are writing any web scraping script in Python, then it will be completed maximum in few hundreds of line whereas if you will choose java or any other language then it can go to thousands of lines. Due to these advantages Python is preferred in data mining for exploring big data potential. One another benefit of using Python in data mining is Python modules. There are many Python modules are available for data mining which can be easily implemented during the scripting.

For huge volume of web data mining, it is always good to go for automation using custom made script so that time and effort can be saved substantially. Data mining will act as a phase in which we could get data for processing and later for analysis. So it also comes under the big data collection phase. So big data tools can be easily implemented on it.

Broadly speaking, there are three different steps in web data mining. One is identifying the source or sources of data or web pages, mine the desired data and save it into the data processing environment using the right tool; and finally, process the data for decision support. Data identification is the first step in mining process. Here we identify the data in the web page/s that we want to mine. In second step, we check the data pattern i.e. in all pages’ data is showing in same manner and in the source code ‘class’ name of data is same or not. If the class name of data in each web page are same then only we can go for automation otherwise to run an automation job will be really a tough call. Third and last one is writing code and run job. Here we use big data potential like Python. We write code and using the class of that particular value i.e. available in page source code. Input will be anything like url, id etc. but using that it must be redirected to the data page.

Nowadays, everyone prefers to collect web data using automated mining. It is cost saving as well as time saving. For automated web data mining we can use any programming language and different APIs. Python and R seem to be most preferred language for web data mining and also considered as preferred tools for big data projects.