mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
150 words
1 minute
Python Web Crawler Environment Setup
2024-01-01

Environment Setup#

python3/Request libraries/Parsing libraries/Databases/Repositories/Web libraries/App scraping libraries/Web crawler framework libraries

  • Python 3

    • Windows 11 can be downloaded directly from the Store
    • On Linux, apt-get install python3
  • Request libraries

    • requests

      pip3 install requests

    • selenium

      pip install selenium

    • ChromeDriver

      1. View the Chrome version in About Chrome
      2. Download the corresponding version from ChromeDriver
      3. Add ChromeDriver to your environment variables
    • phantomJS

      The new Selenium versions no longer support phantomJS; you can use it directly with ChromeDriver

      Verification:

      from selenium import webdriver
      from selenium.webdriver.chrome.options import Options
      chrome_options = Options()
      chrome_options.add_argument('--headless')
      chrome_options.add_argument('--disable-gpu')
      driver = webdriver.Chrome(options=chrome_options)
      driver.get("<https://dreaife.icu/>")
      print(driver.current_url)
    • aiohttp

      pip install aiodns

  • Parsing libraries

    • lxml

      pip install lxml

    • beautifulsoup4

      pip install beautifulsoup4

    • pyquery

      pip install pyquery

    • tesserocr

      • Install Tesseract

        Windows

      • Install tesserocr

        Windows using pip install <name>.whl

      • Verification

        202401011649852.png

        import tesserocr
        from PIL import Image
        image = Image.open('G:/codeS/backOnGithub/Jupyter/spider/image.png')
        print(tesserocr.image_to_text(image))

        Note: If you encounter File “tesserocr.pyx”, line 2580, in tesserocr._tesserocr.image_to_textRuntimeError: Failed to init API, possibly an invalid tessdata path error, you need to first put tessdata into the error folder

  • Databases

    • MySQL
    • MongoDB
    • Redis
  • Repositories

    • PyMySQL

      pip install pymysql

    • PyMongo

      pip install pymongo

    • redis-py

      pip install redis

    • RedisDump

      Install Ruby

      gem install redis-dump

  • Web libraries

    • Flask

      pip install flask

    • Tornado

      pip install tornado

  • App scraping libraries

    • Charles

    • mitmproxy

      pip install mitmproxy

    • Appium

  • Web crawling frameworks

    • pyspider

      pip install pyspider

      If Windows 11 cannot run it, you can refer to this article

    • scrapy

    • scrapy-splash

    • scrapy-redis

Share

If this article helped you, please share it with others!

Python Web Crawler Environment Setup
https://dreaife.tokyo/en/posts/python-env-setup/
Author
dreaife
Published at
2024-01-01
License
CC BY-NC-SA 4.0

Some information may be outdated

Related Posts Smart
1
Web Crawling Basics
spider A web crawler is an automated program used to obtain information from web pages. Its basic workflow includes sending HTTP requests to retrieve page source code, extracting the required data, and saving it. Since web pages are built from HTML, CSS, and JavaScript, crawlers need to handle both static and dynamic pages. Sessions and cookies maintain user state, while proxy servers can hide the real IP address. Common request methods include GET and POST, and response status codes indicate request results. Crawlers should follow anti-scraping constraints and use proxies and proper headers to improve efficiency.
2
Learning Basic Spider Libraries
spider This article studies basic web scraping libraries, including Python urllib and requests. It introduces HTTP request construction, exception handling, URL parsing, regular expression usage, and how to extract information from the Maoyan movie ranking page. It also emphasizes advanced usage such as request headers, cookies, proxy settings, and session persistence.
3
Running pyspider on Windows 11 with Docker
spider If installation problems occur when using pyspider on Windows 11, Docker can be used as an alternative installation method. This post provides examples using Docker commands and docker-compose. After startup, you can verify whether pyspider is running correctly by visiting http://localhost:5000/.
4
Pandas Basics
cs-base Pandas is an open-source data analysis library for Python that provides two main data structures, DataFrame and Series, for handling structured data. It supports data cleaning, transformation, analysis, and visualization. After installing Pandas, you can create and operate on Series and DataFrame with simple code, including basic operations, data filtering, and attribute access. Pandas also supports reading and processing CSV and JSON files and provides data cleaning features such as handling missing values and duplicate data.
5
Learning Basic SciPy Usage
cs-base SciPy is an open-source Python library built on NumPy and is widely used in mathematics, science, and engineering, providing functions such as optimization, linear algebra, integration, and interpolation. It can be installed with pip, and modules such as scipy.optimize and scipy.sparse can be used for optimization and sparse matrix processing. SciPy also supports graph structures and spatial data processing, provides multiple distance calculation methods, can interact with Matlab, and can perform significance testing and statistical analysis.

Table of Contents