Recognising Web Page Features

Client : Confidential

price-scraper

The Problem

Our client uses web spiders to crawl the Internet to retrieve products and their prices.  This generally involved software development to crawl through the pages of a given web site, to recognise the pages which are products (and which aren’t) and then to identify the product name and price to add to their search engine. Each web spider, for each individual web site, was built on an open source framework but had to be customised by software developers to extract data from that particular site. Since web site structure can vary significantly the code for each spider had to be modified for the unique structure of a web site. For a large number of web sites this can prove costly and requires more software engineers as the number of sites increased. Also since sites sometimes changed structure code had to be modified when any changes were made to a specific web site.

The issue was that each web site containing products was different from other web sites. For example, product names and product prices can appear in different positions on a page as can prices. Sometimes there are also multiple prices on a page for items like delivery or to show a discount. The client determined that each web site would require custom programming due to these differences.

We were able to suggest a machine learning approach to the problem.

The Solution

Our solution to this client problem was to take away the reliance on software engineers to build scripts for every web site by automating the process of recognising products and their prices so that scripts for crawling web sites could be built automatically.

This was done by first training a classification machine learning algorithm to recognise the difference between product pages and non-product pages based on a number of example pages. The algorithm achieved a high degree of accuracy in determining which pages on a given e-commerce site were product pages and which were not.

This machine learning system then feeds product pages to a second classification system which had be trained to learn how to recognise product names and product prices from the page.  Product names and prices are then extracted for storage in a web search system.

The system cuts the time to incorporate product from a new website by more than 90% and allows a far larger number of sites to be crawled, at a lower cost, than the previous hand coded solution.