Today, massive amount of data is uploaded in web-world creating huge new and exhilarating business opportunities to small and medium size companies. However, collecting all of the required data is only one part of the storyline. Mining and converting these data into actionable is where real business value lies. The overall goal of web data mining process is extract information from various web sources and transform it into an understandable structure for further processing. The task of Data mining is to mine or analyse a large quantity of data using automatic, semi-automatic or manual ways.
In general, there are mainly two ways for data mining. First one is traditional way, manual data mining and other one – we could call automated or semi-automated data mining. Manual mining is a time taking and long process where mining is done one by one or say- by each record or piece of information. It needs a lot of time and a lot of efforts. Whereas automated web data mining is a process wherein most of the repeated tasks can be converted into simple logic based script which can scrap all web data as desired. As its name shows that things can be automated – say close to 100%. Whole mining process runs automatically using some algorithm. It is the approach that companies prefer to use for web data data mining.
To make the web data mining process automated, we can follow different methods based on the web structure, data format and size; it may require a custom made language script -script can be in R, Python, etc. and API. We can define process using any scripting language and run it for sourcing the entire data from websites. It will scrap all data whatever we instructed in the code.
We could leverage Python or R. Both are well-known for scripting – data mining especially used in Big data projects. The beauty of python is that it is a user friendly language. A person from non-technical background can quickly learn and understand it without much difficulties. Another benefits of using Python is that ‘number of lines in script. If you are writing any web scraping script in Python, then it will be completed maximum in few hundreds of line whereas if you will choose java or any other language then it can go to thousands of lines. Due to these advantages Python is preferred in data mining for exploring big data potential. One another benefit of using Python in data mining is Python modules. There are many Python modules are available for data mining which can be easily implemented during the scripting.
For huge volume of web data mining, it is always good to go for automation using custom made script so that time and effort can be saved substantially. Data mining will act as a phase in which we could get data for processing and later for analysis. So it also comes under the big data collection phase. So big data tools can be easily implemented on it.
Broadly speaking, there are three different steps in web data mining. One is identifying the source or sources of data or web pages, mine the desired data and save it into the data processing environment using the right tool; and finally, process the data for decision support. Data identification is the first step in mining process. Here we identify the data in the web page/s that we want to mine. In second step, we check the data pattern i.e. in all pages’ data is showing in same manner and in the source code ‘class’ name of data is same or not. If the class name of data in each web page are same then only we can go for automation otherwise to run an automation job will be really a tough call. Third and last one is writing code and run job. Here we use big data potential like Python. We write code and using the class of that particular value i.e. available in page source code. Input will be anything like url, id etc. but using that it must be redirected to the data page.
Nowadays, everyone prefers to collect web data using automated mining. It is cost saving as well as time saving. For automated web data mining we can use any programming language and different APIs. Python and R seem to be most preferred language for web data mining and also considered as preferred tools for big data projects.