Using Selenium with Python on Quest

Quest and Kellogg Linux Cluster Downtime, December 14 - 18.

Quest, including the Quest Analytics Nodes, the Genomics Compute Cluster (GCC), the Kellogg Linux Cluster (KLC), and Quest OnDemand, will be unavailable for scheduled maintenance starting at 8 A.M. on Saturday, December 14, and ending approximately at 5 P.M. on Wednesday, December 18. During the maintenance window, you will not be able to login to Quest, Quest Analytics Nodes, the GCC, KLC, or Quest OnDemand submit new jobs, run jobs, or access files stored on Quest in any way including Globus. For details on this maintenance, please see the Status of University IT Services page.

Quest RHEL8 Pilot Environment - November 18.

Starting November 18, all Quest users are invited to test and run their workflows in a RHEL8 pilot environment to prepare for Quest moving completely to RHEL8 in March 2025. We invite researchers to provide us with feedback during the pilot by contacting the Research Computing and Data Services team at quest-help@northwestern.edu. The pilot environment will consist of 24 H100 GPU nodes and seventy-two CPU nodes, and it will expand with additional nodes through March 2025. Details on how to access this pilot environment will be published in a KB article on November 18.

Selenium is a tool for automating web applications commonly used for webscraping. It works with several different web browsers (Chrome, Firefox, …) and programming languages (Java, Python, CSharp, …). To run a Selenium script that automates functionalities of web browsers, browser-specific drivers and libraries are required. Manually managing these components can be cumbersome, which lead to the development of Selenium Manager, a browser driver management tool that is included with all recent Selenium releases. Here, we demonstrate how to install and use Selenium (with Selenium Manager) in a virtual environment to run simple python webscraping scripts on Quest.

Note: While there are Chrome installations available on Quest, we recommend that users running Selenium through python use the browser driver versions automatically downloaded and cached by Selenium Manager.

 

Creating and activating a virtual environment on Quest

First, load the mamba module on Quest:

[@quser32 py_selenium_ex]$ module load mamba/23.1.0

Next, create a virtual environment with python, selenium, and whatever other packages you may need. The --prefix argument creates the virtual environment in a specified location, rather than the default (/home/<net_id>/.conda/envs/).

[@quser32 py_selenium_ex]$ mamba create --prefix ./my_selenium_env -c conda-forge python=3.11 selenium matplotlib ipykernel pandas --yes

Once the virtual environment has been created, activate it with the conda activate command. You may need to first run the command eval "$(conda shell.bash hook)" depending on whether or not you have initialized your shell to use conda.

[@quser31 py_selenium_ex]$ eval "$(conda shell.bash hook)"

(base) [@quser31 py_selenium_ex]$ conda activate ./my_selenium_env/

(~/examples/py_selenium_ex/my_selenium_env) [tdm5510@quser31 py_selenium_ex]$ 

You have now created and activated an Anaconda virtual environment including python and selenium on Quest. While this environment is activated, the specified python version and packages will be available. To deactivate the environment (and return to the base environment), run conda deactivate. For more information about Anaconda virtual environments on Quest, see this page.

 

A simple example Python script using Selenium and Chrome

To run a python script that uses Selenium to automate Chrome, it is important to include several chrome driver options that are included in the following example script. Without these options, launching the web driver on Quest may fail.

selenium_chome_ex.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

#### optionally include to print some debugging information ####
import logging
logging.basicConfig(level=logging.DEBUG)
################################################################

## include these chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-first-run")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-metrics")
chrome_options.add_argument("--disable-translate")
chrome_options.add_argument("--bwsi")

driver = webdriver.Chrome(options = chrome_options)

################################################################

## use Selenium to do something simple
driver.get("https://www.google.com")
print(driver.title)
driver.quit()

print('Done.')

Here, the added options specify the following:

  • --headless : does not open a browser window (will not work on a remote machine like a Quest compute node without graphics forwarding)
  • --no-first-run : skips chrome first run tasks which could cause automation to fail
  • --no-sandbox : runs the process in a ‘non-sandboxed’ (less restricted) environment
  • --disable-dev-shm-usage : prevents storage of temporary Chrome files in a shared memory location (that users may not have access to on Quest)
  • --disable-metrics : prevents Chrome from collecting metrics about these processes
  • --disable-translate : disables Chrome’s translate feature, which may interfere with some automation processes
  • --bwsi : ‘browse without sign-in’ starts a guest session (the user/process will not be prompted to log into Chrome)

 

After including these Chrome options, this script simply loads Google and prints the title of the page. To automate your own Chrome processes, see Selenium’s documentation or other resources.

 

Running your Selenium script on Quest as a batch job

To run the above script (selenium_chome_ex.py) as a batch job on Quest, use the following submission script.

#!/bin/bash

#SBATCH --account=<allocation_id> ## your allocation 
#SBATCH --partition=<partition> ## e.g. short, normal, long, buyin 
#SBATCH --nodes=1 ## change this if parallelizing over multiple nodes
#SBATCH --ntasks-per-node=1 ## change this if parallelizing over multiple cpus
#SBATCH --mem=8GB ## change this if necessary
#SBATCH --time=00:25:00 ## change this if necessary
#SBATCH --output=./selen_ex_out.out ## where standard output and error are written
#SBATCH --job-name=selen_ex_job ## job name for your reference 

module purge

module load mamba/23.1.0

eval "$(conda shell.bash hook)"

conda activate ./my_selenium_env/

python ./selenium_chrome_ex.py

conda deactivate

For more information about submitting jobs on quest, see this page. To develop and run python scripts with Selenium in a Jupyter notebook, create an iPython kernel from your virtual environment following these instructions.

 

 

 

 

Was this helpful?
0 reviews