Scrapy
2 minutes of reading
Scrapy is an open source framework written in Python for processing data from websites. It is a tool designed for web scraping, which is the automatic retrieval of data from websites.
Often when programming we use available APIs that provide us with the data we need for our application. For example, building an app that will show us the current weather, we need to get this data from somewhere, and most often we use the available APIs on the market, but what if we can't find the API we are interested in? That's when it's worth considering, page scraping. In this article I will just introduce a tool that will help us scrape pages.
What is page scraping?
Page scraping is nothing more than extracting some content from a page and saving this data for use in your application, for example. Page scraping is used by sites such as ceneo, google, or portals that collect job listings from other portals. Keep in mind that what we do later with such data can sometimes be illegal.
What is Scrapy?
Scrapy is a Python language framework and it is the most popular and powerful tool for scraping websites. Scrapy provides all the necessary tools you need to efficiently extract data from pages, process it and store it in your preferred structure and format. Scrapy is easy to use, has support for asynchronous requests, and automatically adjusts indexing speed with an "Auto-throttling" mechanism.
Scrapy Spider
The most important part in Scrapy are the Spider classes. Scrapy uses them to collect information from the website. They define how our Spider should extract data from the page.
An example of a Spider class that extracts quotes from a page.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
We write such code to the file "quotes_spider.py" and start our scraping bot with the command:
scrapy runspider quotes_spider.py -o quotes.jl
When our bot finishes its work we should get a file "quotes.jl", which will contain a list of quotes saved in json format.
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...
Related articles
The Pros and Cons of Using PhpMyAdmin in Your PHP Development
6 Jun 2023
In PHP development, using PhpMyAdmin can be a convenient way of managing databases. However, it also has its drawbacks. In this article, we will explore the pros and cons of using PhpMyAdmin, to help you decide if it's the right tool for your project.

The Ethics of Grey Hat SEO
6 Jun 2023
Grey Hat SEO practices lie in a murky ethical territory between White Hat (ethical) and Black Hat (unethical) SEO. The line between ethical and unethical SEO can sometimes be blurred and can raise important ethical questions about what tactics are acceptable to use in the pursuit of higher search engine rankings.
The Traits of a Successful Tech Leader
6 Jun 2023
A successful tech leader possesses a unique combination of technical expertise, leadership skills, and the ability to inspire and motivate their teams. They must also possess excellent communication and problem-solving skills while staying up-to-date with the latest industry trends and technologies.
Common Types of red brick walland Their Functions
5 Jun 2023
Firewalls are essential for network security. In this article, we will discuss the most common types of firewalls, including packet-filtering, circuit-level, application-level, and next-generation. We will also explore their unique functions and how they protect networks from various cyber threats.
How to Secure Your Server with Fail2ban
5 Jun 2023
In today's interconnected world, server security is of paramount importance. As businesses and individuals increasingly rely on servers to store and process sensitive data, it becomes crucial to implement robust security measures to protect against potential threats. One such powerful tool that aids in fortifying server security is Fail2ban.
Common Mistakes to Avoid in QAQC Testing
5 Jun 2023
Improving software quality involves efficient testing. However, QAQC testing can be challenging, and certain mistakes can compromise the effectiveness of the process. In this article, we'll explore common mistakes to avoid in QAQC testing that can help improve the overall quality of software development.
Why Justified Text Might Not Always Be the Best Choice
5 Jun 2023
In typography, justified text has long been considered the 'holy grail' of formatting. However, it may not always be the best choice. While it can create an elegant and organized appearance, it can also lead to awkward spaces and make reading more difficult. In this article, we'll explore the pros and cons of justified text and when it's appropriate to use it.
Show all articles