desktop wallpaper 1

desktop wallpaper 2

desktop wallpaper 3

desktop wallpaper 4

mobile wallpaper 1

mobile wallpaper 2

mobile wallpaper 3

mobile wallpaper 4

dreaife

Announcement

welcome to my blog

Tags

dreaife

Announcement

welcome to my blog

Site Statistics

Posts

71

Categories

13

Tags

58

Total Words

127,637

Running Days

0 days

Last Activity

0 days ago

Tags

dreaife

Announcement

welcome to my blog

Site Statistics

Posts

71

Categories

13

Tags

58

Total Words

127,637

Running Days

0 days

Last Activity

0 days ago

Tags

Categories

Categories

Categories

150 words

1 minute

Python Web Crawler Environment Setup

2024-01-01

/

/

Environment Setup#

python3/Request libraries/Parsing libraries/Databases/Repositories/Web libraries/App scraping libraries/Web crawler framework libraries

Python 3
- Windows 11 can be downloaded directly from the Store
- On Linux, apt-get install python3
Request libraries
- requests
  
  pip3 install requests
- selenium
  
  pip install selenium
- ChromeDriver
  1. View the Chrome version in About Chrome
  2. Download the corresponding version from ChromeDriver
  3. Add ChromeDriver to your environment variables
- ~~phantomJS~~
  
  The new Selenium versions no longer support phantomJS; you can use it directly with ChromeDriver
  
  Verification:
```
1
from selenium import webdriver
2
from selenium.webdriver.chrome.options import Options
3

4
chrome_options = Options()
5
chrome_options.add_argument('--headless')
6
chrome_options.add_argument('--disable-gpu')
7
driver = webdriver.Chrome(options=chrome_options)
8
driver.get("<https://dreaife.icu/>")
9
print(driver.current_url)
```
- aiohttp
  
  pip install aiodns
Parsing libraries
- lxml
  
  pip install lxml
- beautifulsoup4
  
  pip install beautifulsoup4
- pyquery
  
  pip install pyquery
- tesserocr
  - Install Tesseract
    
    Windows
  - Install tesserocr
    
    Windows using pip install <name>.whl
  - Verification
    1 import tesserocr 2 from PIL import Image 3 4 image = Image.open('G:/codeS/backOnGithub/Jupyter/spider/image.png') 5 print(tesserocr.image_to_text(image))
    Note: If you encounter File “tesserocr.pyx”, line 2580, in tesserocr._tesserocr.image_to_textRuntimeError: Failed to init API, possibly an invalid tessdata path error, you need to first put tessdata into the error folder
Databases
- MySQL
- MongoDB
- Redis
Repositories
- PyMySQL
  
  pip install pymysql
- PyMongo
  
  pip install pymongo
- redis-py
  
  pip install redis
- RedisDump
  
  Install Ruby
  
  gem install redis-dump
Web libraries
- Flask
  
  pip install flask
- Tornado
  
  pip install tornado
App scraping libraries
- Charles
- mitmproxy
  
  pip install mitmproxy
- Appium
Web crawling frameworks
- pyspider
  
  pip install pyspider
  
  If Windows 11 cannot run it, you can refer to this article
- scrapy
- scrapy-splash
- scrapy-redis

Share

If this article helped you, please share it with others!

Python Web Crawler Environment Setup

https://dreaife.tokyo/en/posts/python-env-setup/

Author

dreaife

Published at

2024-01-01

License

CC BY-NC-SA 4.0

Some information may be outdated

About Errors When Using pandas.to_datetime with Different Time Formats

Getting Started with Elasticsearch

Related Posts Smart

Web Crawling Basics

spider A web crawler is an automated program used to obtain information from web pages. Its basic workflow includes sending HTTP requests to retrieve page source code, extracting the required data, and saving it. Since web pages are built from HTML, CSS, and JavaScript, crawlers need to handle both static and dynamic pages. Sessions and cookies maintain user state, while proxy servers can hide the real IP address. Common request methods include GET and POST, and response status codes indicate request results. Crawlers should follow anti-scraping constraints and use proxies and proper headers to improve efficiency.

Learning Basic Spider Libraries

spider This article studies basic web scraping libraries, including Python urllib and requests. It introduces HTTP request construction, exception handling, URL parsing, regular expression usage, and how to extract information from the Maoyan movie ranking page. It also emphasizes advanced usage such as request headers, cookies, proxy settings, and session persistence.

Running pyspider on Windows 11 with Docker

spider If installation problems occur when using pyspider on Windows 11, Docker can be used as an alternative installation method. This post provides examples using Docker commands and docker-compose. After startup, you can verify whether pyspider is running correctly by visiting http://localhost:5000/.

cs-base Pandas is an open-source data analysis library for Python that provides two main data structures, DataFrame and Series, for handling structured data. It supports data cleaning, transformation, analysis, and visualization. After installing Pandas, you can create and operate on Series and DataFrame with simple code, including basic operations, data filtering, and attribute access. Pandas also supports reading and processing CSV and JSON files and provides data cleaning features such as handling missing values and duplicate data.

Learning Basic SciPy Usage

cs-base SciPy is an open-source Python library built on NumPy and is widely used in mathematics, science, and engineering, providing functions such as optimization, linear algebra, integration, and interpolation. It can be installed with pip, and modules such as scipy.optimize and scipy.sparse can be used for optimization and sparse matrix processing. SciPy also supports graph structures and spatial data processing, provides multiple distance calculation methods, can interact with Matlab, and can perform significance testing and statistical analysis.

Random Posts Random

Experiment 9: Encryption, Digital Signatures, and Certificates

cs-base 2022-07-01

CSAPP Chapter 1: A Tour of Computer Systems

cs-base 2023-01-15

Uploading Large Files to GitHub

infra 2022-07-07

Learning JavaScript

FRONTEND 2024-11-16

Getting Started with TypeScript

FRONTEND 2024-11-04

Table of Contents