Scrapy - technology that simplifies web scraping

Often when programming we use available APIs that provide us with the data we need for our application. For example, building an app that will show us the current weather, we need to get this data from somewhere, and most often we use the available APIs on the market, but what if we can't find the API we are interested in? That's when it's worth considering, page scraping. In this article I will just introduce a tool that will help us scrape pages.

scrapy

What is page scraping?

Page scraping is nothing more than extracting some content from a page and saving this data for use in your application, for example. Page scraping is used by sites such as ceneo, google, or portals that collect job listings from other portals. Keep in mind that what we do later with such data can sometimes be illegal.

Are you looking for a contractor working with Scrapy ?

Check case studies

What is Scrapy?

Scrapy is a Python language framework and it is the most popular and powerful tool for scraping websites. Scrapy provides all the necessary tools you need to efficiently extract data from pages, process it and store it in your preferred structure and format. Scrapy is easy to use, has support for asynchronous requests, and automatically adjusts indexing speed with an "Auto-throttling" mechanism.

Scrapy Spider

The most important part in Scrapy are the Spider classes. Scrapy uses them to collect information from the website. They define how our Spider should extract data from the page.

An example of a Spider class that extracts quotes from a page.

import scrapy
class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'https://quotes.toscrape.com/tag/humor/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'author': quote.xpath('span/small/text()').get(),
                'text': quote.css('span.text::text').get(),
            }
        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

We write such code to the file "quotes_spider.py" and start our scraping bot with the command:

scrapy runspider quotes_spider.py -o quotes.jl

When our bot finishes its work we should get a file "quotes.jl", which will contain a list of quotes saved in json format.

{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...

Our offer

The benefits of long-tail keywords for SEO

3 Sep 2024

Explore the untapped potential of long-tail keywords in your SEO strategy. These specific, less competitive phrases can surprisingly boost your website's visibility. Dive into the intriguing world of long-tail SEO, discover its benefits, and learn to master its power unseen by many.

Tomasz Kozon

#marketing

related-article-image-cat, long tail, long-tail keywords

Mastering UX writing: A comprehensive guide to enhancing usability

29 Aug 2024

UX writing is the practice of crafting micro-copy that guides a user within digital products. A critical aspect of usability, it helps users understand how to interact with an interface. In this article, we'll unpack UX writing and strategies on mastering it, positioning you to elevate user experience through simple, precise, and engaging copy.

Tomasz Kozon

#web-design

Understanding the concepts of Domain-Driven Design (DDD)

29 Aug 2024

Domain-Driven Design (DDD) is a powerful strategy for building effective, complex software systems. Conceptualizing abstract domain models often poses challenges. This comprehensive guide serves to decipher the intricacies of DDD, delivering a practical roadmap for software developers and architects.

Tomasz Kozon

#devops

How to design for accessibility: Tips and techniques

29 Aug 2024

In today's digital world, inclusivity and accessibility are critical to creating user-friendly applications. This article will guide you through key principles and practices of Accessibility Design, enabling you to craft more inclusive digital experiences. It aims to aid both seasoned developers and beginners in understanding the significance of these principles in shaping the digital ecosystem.

Tomasz Kozon

#web-design

Understanding the microservices architecture: Pros and cons

26 Aug 2024

Unraveling the world of Microservices Architecture - a prevalent system design trend, this piece discusses its unique benefits and impediments. By dissecting this modern technology, we aim to provide you with insight that can guide choices about your tech stack, illuminating both the sunlit uplands of its advantages and the shadowed landscapes of its pitfalls.

Tomasz Kozon

#back-end

How to use storyboarding in UX design

22 Aug 2024

Storyboarding is a powerful tool in UX design that helps visualize the user journey and identify potential pain points early in the process. By creating a visual narrative, designers can better understand and communicate how users will interact with a product or service.

Tomasz Kozon

#web-design

Strengthening the defenses: The role of artificial intelligence in cybersecurity

13 Aug 2024

In a world where data breaches and cybercrimes continue to rise, cybersecurity has never been more crucial. The role of Artificial Intelligence in fortifying these defenses presents an intersection of profound potential. This union is increasingly becoming the backbone of robust security strategies, revolutionizing how we guard against, detect, and respond to emerging cyber threats.

Tomasz Kozon

#security

Show all articles

What is page scraping?

What is Scrapy?

Scrapy Spider

Our offer

Web development

Mobile development

E-commerce

UX/UI Design

Outsourcing

SEO

Related articles

The benefits of long-tail keywords for SEO

Tomasz Kozon

Mastering UX writing: A comprehensive guide to enhancing usability

Tomasz Kozon

Understanding the concepts of Domain-Driven Design (DDD)

Tomasz Kozon

How to design for accessibility: Tips and techniques

Tomasz Kozon

Understanding the microservices architecture: Pros and cons

Tomasz Kozon

How to use storyboarding in UX design

Tomasz Kozon

Strengthening the defenses: The role of artificial intelligence in cybersecurity

Tomasz Kozon