336 views

How to set up a proxy in ParseHub

ParseHub is an automated tool for collecting information from various web resources, ideal for subsequent analysis. Key features and functions of this web scraping service include:

  • ParseHub's web scraping service is intuitive, allowing you to select page elements for extraction with simple clicks. You can customize settings to specify the type of data you need, such as text, images, or links.
  • The tool supports simultaneous data processing from multiple pages or sites. It also allows for the creation of action sequences to scrape data from different website sections.
  • The ParseHub API facilitates the export of extracted data into various formats like CSV, Excel, JSON, and Google Sheets, streamlining analysis and processing.

ParseHub is favored by marketers, analysts, and companies that regularly gather website data, not only for its extensive functionality but also for its proxy integration capability. Using proxy servers for data parsing enhances scraper productivity by:

  • Proxies conceal your real IP address during data collection, enhancing security.
  • They help avoid site restrictions, especially when high activity from a single IP address is detected.
  • Proxies enable the distribution of requests across multiple IP addresses, reducing server load.
  • They can mimic requests from various geographic locations, essential for accessing region-specific data.

Optimizing web scraping with proxies boosts data analysis productivity and efficiency. For guidance on configuring a proxy in ParseHub, refer to the tutorial below.

Video tutorial for proxy configuration in ParseHub

Setting up a proxy in ParseHub on Windows and MacOS

To configure a proxy in the utility, which includes a built-in browser, follow these steps (applicable for both Windows and MacOS as the interfaces are identical):

  1. Open the program and click the “New Project” button in the main window to create a new project for proxy configuration.
  2. image001.png

  3. In the next window, enter the URL of the site you wish to parse in the left pane, then click “Start project on this URL”.
  4. image003.png

  5. Activate the “Browse” slider to begin parsing information from the website.
  6. image005.png

  7. To configure the proxy, access the ParseHub browser menu by clicking the three horizontal lines in the top right corner.
  8. image007.png

  9. In the dropdown menu, select the “Options” gear button.
  10. image009.png

  11. In the “Options” menu, navigate to the “Advanced” section on the left, go to the “Network” tab, and in the “Connection” settings, click the “Settings” button.
  12. image011.png

  13. In the following window, select “Manual Proxy Configuration” to enable fields for entering IP address, port, and authentication data.
  14. image013.png

  15. For an HTTP proxy, input the data in the format “IP:username:password”. If you wish to use the same IP and port for SOCKS, FTP, and SSL protocols, select “Use this proxy server for all protocols”.
  16. image014.png

  17. Below, there's an option to specify exceptions for the proxy. You can list IP addresses or website URLs where the proxy encryption won't apply. After finalizing your settings, click OK.
  18. image015.png

With these settings, site parsing for this project will occur through a proxy server, providing anonymous access to data and helping avoid blocks due to frequent requests from the same IP address.

Connecting a proxy in ParseHub on Linux

Configuring a proxy in ParseHub on a Linux device can be done in two ways: through a configuration file or using an API. We'll start with the simpler method of creating a configuration file.

  1. Create a proxy configuration file named proxy.json using any text editor. This file should include the proxy's name, server address, port, username, and password. Use the following template and enter it into your terminal:
  2. {

    "proxies": [

    {

    "name": "YourProxyName",

    "server": "ProxyServerAddress",

    "port": ProxyServerPort,

    "username": "ProxyUsername",

    "password": "ProxyPassword"

    }

    ]

    }

  3. Replace each placeholder with the details of the proxy server.
    • YourProxyName - choose a convenient name for your proxy.
    • ProxyServerAddress - enter your server's address.
    • ProxyServerPort - specify the port number.
    • ProxyUsername - your authentication username.
    • ProxyPassword - your verification password.

    image023.png

  4. Save the file on your PC. To launch ParseHub with these proxy settings, run the command parsehub “proxy/path/to/your/proxy.json” in the terminal, as shown in the provided screenshot.
  5. image024.png

For the second method, integrating ParseHub with Python, follow these steps:

  1. Install the requests library by entering “pip install requests” in your terminal.
  2. Set up access to your ParseHub API keys. Replace the placeholders in the following template with your details:
    • parsehub_api_url - address of your API key.
    • proxy - enter your proxy's technical details (e.g., 156.25.7.9:9090).
    • headers - your API key code.

    image026.png

  3. Initiate a request to ParseHub using these proxy settings. For instance, the provided screenshot shows a code snippet for sending a GET request and processing the response.
  4. image027.png

  5. For private proxies, use the following code structure:
  6. pip install requests

    import requests

    proxy_ip = 'IP address'

    proxy_port = 'port number'

    proxy_username = 'username'

    proxy_password = 'password'

    session = requests.Session()

    session.proxies = {

    'http': f'http://{proxy_username}:{proxy_password}@{proxy_ip}:{proxy_port}',

    'https': f'https://{proxy_username}:{proxy_password}@{proxy_ip}:{proxy_port}'

    }

    url = 'https://example.com'

    response = session.get(url)

    print(response.text)

    Refer to the example code and screenshot for guidance:

    Безымянный.jpg

Once your proxy settings and ParseHub API key are correctly configured, your Python setup will be successful. Using private proxies enhances security and anonymity for web scraping, and enables access to sites blocked by your ISP.