
Photo by Declan Sun on Unsplash
Python, while being a versatile and user-friendly programming language, does have one limitation when it comes to concurrency. The Global Interpreter Lock (GIL) in CPython (the reference implementation of Python) can prevent the full utilization of multiple CPU cores. However, Python provides an alternative method to sidestep the GIL through the multiprocessing
module. This module allows developers to achieve true parallelism by running separate processes on different cores of the CPU.
In this blog, we’ll dive deep into Python’s multiprocessing
module, explore how it works, and provide real-world examples of how to write efficient, parallelized code to maximize your program's performance.
What is the multiprocessing
Module?
The multiprocessing
module in Python allows you to create multiple processes that run in parallel, fully utilizing the CPU cores. Unlike threading, where threads share the same memory space and are subject to the GIL, each process in multiprocessing
has its own independent memory space. This means you can run CPU-bound tasks in parallel and achieve better performance, especially for compute-heavy operations.
By using this module, you can split a task into smaller subtasks, distribute them across different CPU cores, and wait for them to complete. This is a typical pattern in tasks like image processing, data analysis, web scraping, machine learning, and many other fields.
Understanding the Basics of multiprocessing
The core idea of multiprocessing
is to run tasks concurrently by creating separate processes, each with its own memory and Python interpreter. Let’s first understand how to create and manage processes using this module.
Creating a Process
The multiprocessing
module provides a Process
class that allows you to spawn new processes. Each process runs independently and can execute a target function.
Example 1: Simple Process Creation
import multiprocessing
import time
def worker_function(name):
print(f"Worker {name} started.")
time.sleep(2)
print(f"Worker {name} finished.")
if __name__ == "__main__":
# Creating two processes
process1 = multiprocessing.Process(target=worker_function, args=("A",))
process2 = multiprocessing.Process(target=worker_function, args=("B",))
# Starting the processes
process1.start()
process2.start()
# Joining the processes (wait until they finish)
process1.join()
process2.join()
print("Both workers finished.")
Explanation:
We created two
Process
objects, each running theworker_function
.The
start()
method begins execution of the processes.The
join()
method ensures that the main program waits for the processes to complete before continuing.
Output:
Worker A started.
Worker B started.
Worker A finished.
Worker B finished.
Both workers finished.
How the multiprocessing
Module Handles Processes
The multiprocessing
module uses the fork system call to create child processes in Unix-based systems (like Linux and macOS), and on Windows, it uses a spawn method to create new processes. Each process runs its own copy of the Python interpreter, which ensures true parallelism. This is why processes are not subject to the GIL and can fully utilize multiple CPU cores.
Sharing Data Between Processes
When working with multiprocessing, you often need to share data between processes. Since each process runs in its own memory space, they cannot access each other’s variables directly. Fortunately, multiprocessing
provides mechanisms like Queues, Pipes, and Shared Memory to facilitate communication between processes.
Using a Queue for Data Sharing
A Queue
is a process-safe way to exchange data between different processes. It operates like a standard Python queue but is designed to be shared between processes.
import multiprocessing
def producer(queue):
for i in range(5):
print(f"Producer adding {i} to queue")
queue.put(i)
def consumer(queue):
while True:
item = queue.get()
if item is None: # End signal
break
print(f"Consumer received {item}")
if __name__ == "__main__":
queue = multiprocessing.Queue()
# Creating producer and consumer processes
producer_process = multiprocessing.Process(target=producer, args=(queue,))
consumer_process = multiprocessing.Process(target=consumer, args=(queue,))
producer_process.start()
consumer_process.start()
producer_process.join()
queue.put(None) # Sending end signal to consumer
consumer_process.join()
Explanation:
We create a queue using
multiprocessing.Queue()
to pass data between processes.The
producer
function puts items into the queue, while theconsumer
function takes items out of the queue.After the producer finishes, we send a
None
to signal the consumer process to stop.
Output:
Producer adding 0 to queue
Producer adding 1 to queue
Producer adding 2 to queue
Producer adding 3 to queue
Producer adding 4 to queue
Consumer received 0
Consumer received 1
Consumer received 2
Consumer received 3
Consumer received 4
Shared Memory with Value and Array
In some situations, you may want to share mutable data structures like arrays or values between processes. The multiprocessing
module provides Value
and Array
for this purpose.
import multiprocessing
def increment(shared_value):
for _ in range(100000):
shared_value.value += 1
if __name__ == "__main__":
shared_value = multiprocessing.Value('i', 0) # Shared integer value
processes = []
for _ in range(4):
p = multiprocessing.Process(target=increment, args=(shared_value,))
processes.append(p)
p.start()
for p in processes:
p.join()
print(f"Final value: {shared_value.value}")
Explanation:
multiprocessing.Value
creates a shared integer that can be modified by multiple processes.We spawn multiple processes to increment the value, and since the value is shared, all processes affect the same data.
Output:
Final value: 400000
Using Pool for Parallel Processing
A more efficient way to run multiple processes is by using Pool
, which is a part of the multiprocessing
module. Pool
creates a pool of worker processes and distributes the tasks across the available workers.
Example 4: Using Pool to Parallelize Tasks
import multiprocessing
def square_number(n):
return n * n
if __name__ == "__main__":
numbers = [1, 2, 3, 4, 5]
with multiprocessing.Pool(processes=2) as pool:
results = pool.map(square_number, numbers)
print("Squared numbers:", results)
Explanation:
The
Pool
object allows you to parallelize the execution of a function across multiple inputs.We used
pool.map()
to apply thesquare_number
function to all items in the listnumbers
.
Output:
Squared numbers: [1, 4, 9, 16, 25]
Best Practices for Writing Efficient Code Using multiprocessing
1. Avoid Excessive Memory Usage
Since processes in multiprocessing
have independent memory spaces, avoid creating very large objects that need to be copied across processes. Instead, use shared memory or pass data through Queue
or Pipe
for better memory efficiency.
2. Minimize the Use of Global Variables
Global variables are not shared across processes by default. It's better to pass necessary data explicitly to processes to ensure proper parallelism.
3. Use the Pool
Class for Task Parallelism
When performing parallel tasks with a large number of inputs, Pool
is a more efficient solution than manually creating processes. It handles the distribution of tasks and manages the processes internally.
4. Be Aware of Synchronization
When multiple processes access shared resources, consider synchronization mechanisms like Lock
or Semaphore
to prevent race conditions.
Conclusion
In this blog, we have explored Python’s multiprocessing
module, which allows us to take advantage of multiple CPU cores and write truly parallel programs. By using processes instead of threads, Python’s multiprocessing
module enables true parallelism and can significantly speed up CPU-bound tasks.
We covered:
How to create and manage processes using the
Process
class.Sharing data between processes using
Queue
andValue
.Efficient task parallelism using
Pool
.Best practices for writing efficient, parallelized Python code.
Understanding and leveraging the multiprocessing
module is essential for writing efficient and scalable applications, especially when working with CPU-heavy tasks like image processing, machine learning, and large-scale data analysis.
Happy coding!