reading-notes

View project on GitHub

Web scraping:

is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.

Important notes about web scraping:

1 - Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes.

2 - Make sure you are not downloading data at too rapid a rate because this may break the website. You may potentially be blocked from the site as well.

Web scraping:

  • Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.
  • Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser.
  • While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. \

-Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing. Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on.

-Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.

  • Techniques:

    1- Human copy-and-paste:

    The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet.

    2-Text pattern matching:

    A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages.

    3- HTTP programming:

    Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.

    • 4 - HTML parsing:
    • 5 - DOM parsing:By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts.

      6 - Vertical aggregation:

      7 - Semantic annotation recognizing: this technique can be viewed as a special case of DOM parsing. In another case, the annotations, organized into a semantic layer,[3] are stored and managed separately from the web pages, so the scrapers can retrieve data schema and instructions from this layer before scraping the pages.

      8-Computer vision web-page analysis:

      There are efforts using machine learning and computer vision that attempt to identify and extract information from web pages by interpreting pages visually as a human being might.

Resources:

Done by Omar-zoubi