Scrapy
2 minutes of reading
Scrapy is an open source framework written in Python for processing data from websites. It is a tool designed for web scraping, which is the automatic retrieval of data from websites.
Often when programming we use available APIs that provide us with the data we need for our application. For example, building an app that will show us the current weather, we need to get this data from somewhere, and most often we use the available APIs on the market, but what if we can't find the API we are interested in? That's when it's worth considering, page scraping. In this article I will just introduce a tool that will help us scrape pages.
What is page scraping?
Page scraping is nothing more than extracting some content from a page and saving this data for use in your application, for example. Page scraping is used by sites such as ceneo, google, or portals that collect job listings from other portals. Keep in mind that what we do later with such data can sometimes be illegal.
What is Scrapy?
Scrapy is a Python language framework and it is the most popular and powerful tool for scraping websites. Scrapy provides all the necessary tools you need to efficiently extract data from pages, process it and store it in your preferred structure and format. Scrapy is easy to use, has support for asynchronous requests, and automatically adjusts indexing speed with an "Auto-throttling" mechanism.
Scrapy Spider
The most important part in Scrapy are the Spider classes. Scrapy uses them to collect information from the website. They define how our Spider should extract data from the page.
An example of a Spider class that extracts quotes from a page.
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
'https://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.xpath('span/small/text()').get(),
'text': quote.css('span.text::text').get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
We write such code to the file "quotes_spider.py" and start our scraping bot with the command:
scrapy runspider quotes_spider.py -o quotes.jl
When our bot finishes its work we should get a file "quotes.jl", which will contain a list of quotes saved in json format.
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"}
{"author": "Steve Martin", "text": "\u201cA day without sunshine is like, you know, night.\u201d"}
{"author": "Garrison Keillor", "text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d"}
...
Our offer
Web development
Find out moreMobile development
Find out moreE-commerce
Find out moreUX/UI Design
Find out moreOutsourcing
Find out moreRelated articles
The Manifest Names Boring Owl as one of the Most-Reviewed UX Agencies in Warsaw
22 Jul 2024
When introducing a new digital product or solution, focusing on User Experience (UX) can significantly set you apart from the competition.
Website redesign vs. refresh: Evaluating the benefits and drawbacks
11 Jul 2024
In an ever-evolving digital landscape, businesses are constantly called to re-evaluate their online presence. The dilemma often lies in the decision between a complete website redesign or a website refresh. Both actions present unique benefits and challenges. This article aims to dissect these options, highlighting their advantages and disadvantages to better inform your next digital strategy.
How to incorporate machine learning into e-commerce platform
11 Jul 2024
E-commerce thrives on delivering personalized experiences to customers. Harnessing the power of Machine Learning (ML) can redefine these experiences, by predicting user behavior, tailoring recommendations and automating tasks. In this article, we explore the integration of ML into your E-Commerce platform and unveil its transformative potential.
How colours influence website perception?
10 Jul 2024
Bright red sales sign, soothing blue blog post, or an enigmatic black homepage, colors speak volumes in web design. The psychology of colors plays a critical role in how users perceive websites, influencing their actions and decisions subtly shaping the online experience. Let's delve into understanding this colored web of psychology.
Navigating App Development: Web, Mobile or Hybrid?
10 Jul 2024
Embarking on the journey of app development often comes with a critical decision: Web, Mobile, or Hybrid? Each offers unique benefits and challenges, suited to different project demands and goals. This article attempts to navigate through this complex landscape, offering insights to make the best choice.
Crucial role of interruption testing
9 Jul 2024
The rise of digital applications in today's fast-paced world undeniably rests on their performance. But when apps stutter or crash, the culprit can often be traced back to unanticipated interruptions. Understanding this, we delve into the world of 'Interruption Testing', an unsung hero in app performance optimization, which challenges the robustness of applications in the face of unprecedented events and interruptions.
React Native vs Flutter: A comprehensive guide on mobile app development technologies
9 Jul 2024
In the era of mobile app development, choosing the right tech stack becomes crucial. React Native and Flutter, raise as frontrunners with their unique offerings. This article offers an in-depth comparison between the two, aiding you in making informed development decisions.
Show all articles