Location:Home > Engineering science > Computer Science > Computer System Architecture > Data Collection and Preprocessing for Multi-Website Web Log Mining

Data Collection and Preprocessing for Multi-Website Web Log Mining

Downloads: []
Tutor: ZhangYuFang
School: Chongqing University
Course: Computer System Architecture
Keywords: Web Log Mining,Data Collection,Data Preprocessing,Data Cleaning,HTTP Request
CLC: TP311.13
Type: Master's thesis
Year:  2012
Facebook Google+ Email Gmail Evernote LinkedIn Twitter Addthis

not access Image Error Other errors

With the rapid development of Internet and computer, especially Web globalpopularization, the number of users grows, and the amount of information available onWeb is increasing exponentially. These factors give rise to two problems, on the onehand, it hopes that how to enable the web provider to discover user browsing patternsand interests through the analysis of their browsing behavior, so they can get moreusers and commercial value by providing them with personalized service; on the otherhand, it reveals that how users are able to obtain interesting and useful knowledge fromhuge information quickly, which can address the issues of information overloading andimprove retrieval efficiency. Applying data mining techniques to Web log mining andimproving the traditional method, we can solve these problems.Before Web log mining, we must create a suitable target data and preprocess thisdata. In traditional Web log mining, data can be collected at the server side, client side,proxy servers. Each type of data collection differs not only in terms of the location ofthe data source, but also the kinds of data available, the segment of population fromwhich the data was collected, and its method of implementation. Data preprocessing isto extract data that can reveal user¡¯s browsing behavior accurately from collecteddataset through data cleaning and user identification process, and convert them to asuitable format which can be identified by mining algorithms. So data collection andpreprocessing are basic, critical and important task in Web log mining.Generally, data collection and preprocessing are executed under single Websiteenvironment, but in this thesis Web log mining is for multi-Website, so traditionalmethods about data collection and preprocessing are not suitable for this situation. Inorder to collect users¡¯ browsing behavior data, a network sniffer method based oncapturing HTTP protocol data packets is proposed. Meanwhile, the data collected frommulti-Website have many characteristics such as massiveness, diverseness,heterogeneity and dynamics, this makes preprocessing different from existing methods,particularly data cleaning becomes more difficult. Aimed at this problem, by analyzinghttp request, a new data cleaning method which based on the referer relations ofrequests and the intervals of requests is designed.Finally, in order to validate the efficiency of the data collection and data cleaningmethod proposed by the thesis, an experimental system was designed and realized in local area network. The experimental data were evaluated by the indicators of recall,precision and F measure value. The result shows that the new data collection andcleaning method for multi-Website are effective and feasible.
Related Dissertations
Last updated
Sponsored Links
Home |About Us| Contact Us| Feedback| Privacy | copyright | Back to top