Scrapy
2 minutes of reading
Scrapy is an open source framework written in Python for processing data from websites. It is a tool designed for web scraping, which is the automatic retrieval of data from websites.
Often when programming we use available APIs that provide us with the data we need for our application. For example, building an app that will show us the current weather, we need to get this data from somewhere, and most often we use the available APIs on the market, but what if we can't find the API we are interested in? That's when it's worth considering, page scraping. In this article I will just introduce a tool that will help us scrape pages.
What is page scraping?
Page scraping is nothing more than extracting some content from a page and saving this data for use in your application, for example. Page scraping is used by sites such as ceneo, google, or portals that collect job listings from other portals. Keep in mind that what we do later with such data can sometimes be illegal.
What is Scrapy?
Scrapy is a Python language framework and it is the most popular and powerful tool for scraping websites. Scrapy provides all the necessary tools you need to efficiently extract data from pages, process it and store it in your preferred structure and format. Scrapy is easy to use, has support for asynchronous requests, and automatically adjusts indexing speed with an "Auto-throttling" mechanism.
Scrapy Spider
The most important part in Scrapy are the Spider classes. Scrapy uses them to collect information from the website. They define how our Spider should extract data from the page.
An example of a Spider class that extracts quotes from a page.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
We write such code to the file "quotes_spider.py" and start our scraping bot with the command:
scrapy runspider quotes_spider.py -o quotes.jl
When our bot finishes its work we should get a file "quotes.jl", which will contain a list of quotes saved in json format.
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...
Related articles
Why Tailwind UI is a Must-Know for Modern Web Developers?
26 Oct 2023
In the realm of modern web development, the need for efficient tools is increasingly exigent. Harnessing the power of such an asset, Tailwind UI is emerging as a comprehensive solution. Streamlining the development process, it allows to make compelling web interfaces with ease. This write-up aims to explore the quintessence of Tailwind UI in today's digital age.

Understanding SOAP: Key Concepts and Practical Applications
16 Aug 2023
Understanding SOAP (Simple Object Access Protocol) can often prove daunting. This article seeks to demystify SOAP, exploring its core principles and its practical applications. By dissecting its structure and peeling back its layers, we can unravel its true potential and learn how to harness its capabilities in an efficient manner.
Influence of Google Fonts on UX and UI Design
3 Aug 2023
In the fusion of UX and UI design, nothing is trivial. Each element, including typography, plays a pivotal role in engaging user interaction. This article focuses on understanding the crucial influence of Google Fonts on UX & UI design. We delve into its impact on aesthetics, functionality, and overall user experience.
Human-Centered Design: Pivotal Player in the Arena of Modern Technology Development
17 Jul 2023
Human-Centered Design, a novel paradigm in modern technology development, is ensuring a revolution in software designs by prioritizing the user experience. As a core approach to problem-solving, it carefully blends technology with human needs to deliver highly-effective and usable solutions.
Unleashing the Power: A Comprehensive Guide to Mastering Affiliate Marketing
17 Jul 2023
This guide will embark you on a journey through the realm of Affiliate Marketing, illuminating its potency and progressive ways to harness it. Step into the universe where partnerships flourish, revenues stream, and brands expand, using dynamic marketing strategies.
Mastering the Art: Effective Strategies for Lead Nurturing in Tech Industries
17 Jul 2023
Lead nurturing in the tech industry is akin to conducting a symphony, where every note must be played in perfect harmony. It requires patience, precision, and a keen sense of timing. This article gears towards deciphering strategies that can help tech businesses skilfully navigate this intricate realm, fostering stronger customer connections and maximizing lead conversion.
Unleashing the Potential of CKEditor for Seamless Content Creation
5 Jul 2023
In this article, we will explore how to unleash the full potential of CKEditor for seamless content creation. CKEditor is a powerful and versatile text editor that offers a wide range of features and customization options. By understanding its capabilities and implementing best practices, we can optimize content creation processes and enhance the overall user experience. Let's dive in and discover the possibilities that CKEditor brings to the table!
Show all articles