Python Best Practices - Part 2

This is a continuation of our Python Best Practices series. For part one of this series, click here.

Multi-Processing

In many cases, I want to do the same processing to several different items. In this example, my client has 6 different Geodatabases that needed to be extracted. Rather than extract one at a time, we used Python’s multiprocessing to spawn one extractor for each database.

The python “multiprocessing” module is very straight forward to use. First, you need to import the “multiprocessing” module.

There are also two variables you’ll need to declare. One is proc_list, which is defined as a python list. This is used to hold the list of processes we will spawn. So this will have one entry for each process we will spawn. The other is result_queue, which is of type “multiprocessing.Queue()”. This will hold the results of our spawned process and bring it back to the main process.

import multiprocessing

# Get ready for our multiprocessing
proc_list =[]
result_queue = multiprocessing.Queue()

The next section of code below is where we spawn our processes. In this example, my config file has one section for each database that I will extract (hence, one section for each process).

# For each database in our config file, spawn a process to extract it.
for database in config.sections():
logging.info("Starting process for database :: " + database )
    proc = multiprocessing.Process(name=database, target=ProcessDatabase, args=(config_file, workspace, database, result_queue,))
    proc_list.append(proc)
    proc.start()

The key to the code above is the “target” parameter in the multiprocessing.Process call. That target name (in this example, “ProcessDatabase”) is a function defined in your script. This function is defined as:

def ProcessDatabase(config_file, workspace, database, result_queue):
    try:
        proc_name = multiprocessing.current_process().name
        print "Process " + proc_name + " :: starting"
        
        STATUS = 'SUCCESS'
        
        # Read our configuration file
        config.read(config_file)

        # Read our relates configuration file
        relates_config = config.get("DEFAULT", "relate_config")
        relates.read(relates_config)

        # Start a logfile for this database
        logfile = PrepareLogging(database)
        logging.info("Processing database::" + database)

         # Do your processing here
         STATUS = CopyDatasets(connection, fGDB, STATUS)
         STATUS = CopyTables(connection, fGDB, STATUS)

        logging.info(proc_name + " process completed. STATUS = " + STATUS)
        result_queue.put([database, STATUS])

One item of note here: with multiprocessing, your best option is to create a new logfile for each process. Otherwise, trying to serialize the writing to the logfile, and then decipher everything in it, would be a nightmare. So the first thing my “ProcessDatabase” function does is create a new, process-specific logfile.

The config file to support the code above looks like:

[DEFAULTS]
 . . . 
 . . . 

[Database1]
source_sde: c:/connections/database1.sde
file_gdb_pth: c:/fgdb/database1.gdb


[Database2]
source_sde: c:/connections/database2.sde
file_gdb_pth: c:/fgdb/database2.gdb

Going back to my “proc.start” call, all I have to do now is join all the processes to my main process. This forces the main process to wait until it gets a STATUS back from each child process. Use the proc_list to join them:

# Wait on our processes to all finish
for proc in proc_list:
    proc.join()

Your main script will now wait at the “proc.join” statement until all child processes return. Once they all return, the next lines of code will loop thru the results_queue of each process:

# Loop thru our results_queue and see how each process did.
results = [result_queue.get() for proc in proc_list]
STATUS = 'SUCCESS'
for result in results:
    logging.info('Result :: ' + result[0] + ' :: ' + result[1])
    if result[1] <> 'SUCCESS':
        STATUS = 'Failed'
            
# All processes succeeded.  Continue.
if STATUS == 'SUCCESS':
    logging.info('All processes finished successfully.  Processing complete.')

One final note on the multiprocessing. ArcGIS 10.x uses python 2.6 and above, which includes the multiprocessing module. If you’re still on ArcGIS 9.x, you are using Python 2.5, which does NOT include multiprocessing. You will need to download the backport of multiprocessing to Python 2.5, which is available from here.

Remote Processing

In some cases, you may want to spawn a process on a remote server. There are several different possible options for doing this. I found the most reliable to be a simple Popen call to Microsoft’s PSEXEC (part of the SysInternals suite).

from subprocess import Popen, PIPE

def StartRemote(ps_exec):
    logging.debug("Entering StartRemote")
    output = Popen(ps_exec, stdin=None, stdout=PIPE)
    logging.debug(output.stdout.read())

ps_exec = config.get("DEFAULT", "ps_exec")

# Run it.
StartRemote(ps_exec)
logging.info("Remote process completed.")

The config file entry for the “ps_exec” variable looks like this:

[DEFAULTS]
ps_exec: %(scripts_dir)s/fgdb_extract/psexec.exe \\%(remote_server)s %(scripts_dir)s/fgdb_server/fgdb_server.cmd

Sending Email

Normally, if you are scheduling your python script to run daily, you might want to send yourself the logfile when it’s finished. That way, each morning you can start your day by reviewing the nightly processing logfiles. And who doesn’t love starting their day with a good logfile??

Python includes the SMTP module for sending email. Here is a basic function that will send email via SMTP. All my variables are stored in my config file.

def SendLogfile(file, config, STATUS):

    HOST = config.get("DEFAULT", "smtp_host")
    PORT = config.get("DEFAULT", "smtp_port")
    FROM = config.get("DEFAULT", "smtp_from")
    PWD = base64.b64decode(config.get("DEFAULT", "smtp_pwd"))
    TO = eval(config.get("DEFAULT", "smtp_send_to"))

    head, tail = os.path.split(file)
    f = open(file)
    msgbody = f.read()
    f.close()
    
    SUBJECT = "FGDB Extract -- " + STATUS + " -- " + datetime.date.today().strftime("%B %d, %Y")
    
    BODY = msgbody
    
    body = string.join((
        "From: %s" % FROM,
        "To: %s" % ', '.join(TO),
        "Subject: %s" % SUBJECT,
        "",
        BODY), "\r\n")
    
    server = smtplib.SMTP_SSL(HOST, PORT)
    server.ehlo()
    server.login(FROM, PWD)
    server.sendmail(FROM, TO, body)
    server.close()

Encrypting Passwords

Nobody should put unencrypted passwords in a config file. So I recommend using python’s “base64” module to encode the password. To encode a password (in this example, my password is “password”):

C:\>\Python27\ArcGIS10.2\python.exe
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import base64
>>> base64.b64encode("password")
'cGFzc3dvcmQ='
>>>

So my encoded password becomes ‘cGFzc3dvcmQ=’. I can (sort of) safely put that password in my config file. Then in the python script, you decode the password using:

PWD = base64.b64decode(config.get("DEFAULT", "smtp_pwd"))

Security note: while encoding the password makes it slightly more difficult to see your password, any half-way savvy python user can read the code, and realize how to decode that password. Therefore, you should still take care and set proper permissions on your python files so your password doesn’t fall into the wrong hands!

Empowered Solutions Blog

Sharing Our Insights on the Future of Utility Work

Python Best Practices – Part 2

Multi-Processing

Remote Processing

Sending Email

Encrypting Passwords

Esri GIS Python Scripts SSP Blogs

We Wrote the Book

The Indispensible Guide to ArcGIS Online

Jeff Buturff

Associated Blog Posts

Python Best Practices – Part 1

What do you think?

Leave a comment, and share your thoughts Cancel Reply

Let’s Talk

Connect

Follow us on social media