Show abstract
HOLISTIC APPROACH FOR EFFICIENT EXTRACTION OF WEB DATA.
There is a tremendous growth in the volume of information available on the internet, digital libraries, new sources and company database or intranets that contain valuable information. Information from World Wide Web has been a source of information which caters for different sectors ranging from social, political and economical spheres for decision making. Such information would be more valuable if it can be available to the end user and other application systems in required formats. This has caused the need for tools to assist users in extracting relevant information in a fast and effective way. We explore an efficient mechanism of extracting web data through analysis of HTML tags and patterns. HTML constitutes a large percentage of web content. However, much of this content lacks strict structure and proper schema. Additionally, web content has high update frequency and semantic heterogeneity of the information as compared to other format such as XML that are more firms in structure. We have managed to produce a customized generic model that can be used to extract unstructured data from the web and populate it to a database. The main contribution is an automated process for locating, extracting and storing data from HTML web sources. Such data is then available to other application software for analysis and other processing.
more details
- download pdf
- 0 of 0
- 150%