CPU threading and TorchScript inference — PyTorch 2.3 documentation (2024)

PyTorch allows using multiple CPU threads during TorchScript model inference.The following figure shows different levels of parallelism one would find in atypical application:

One or more inference threads execute a model’s forward pass on the given inputs.Each inference thread invokes a JIT interpreter that executes the opsof a model inline, one by one. A model can utilize a fork TorchScriptprimitive to launch an asynchronous task. Forking several operations at onceresults in a task that is executed in parallel. The fork operator returns aFuture object which can be used to synchronize on later, for example:

@torch.jit.scriptdef compute_z(x): return torch.mm(x, self.w_z)@torch.jit.scriptdef forward(x): # launch compute_z asynchronously: fut = torch.jit._fork(compute_z, x) # execute the next operation in parallel to compute_z: y = torch.mm(x, self.w_y) # wait for the result of compute_z: z = torch.jit._wait(fut) return y + z

PyTorch uses a single thread pool for the inter-op parallelism, this thread poolis shared by all inference tasks that are forked within the application process.

In addition to the inter-op parallelism, PyTorch can also utilize multiple threadswithin the ops (intra-op parallelism). This can be useful in many cases,including element-wise ops on large tensors, convolutions, GEMMs, embeddinglookups and others.

Build options

PyTorch uses an internal ATen library to implement ops. In addition to that,PyTorch can also be built with support of external libraries, such as MKL and MKL-DNN,to speed up computations on CPU.

ATen, MKL and MKL-DNN support intra-op parallelism and depend on thefollowing parallelization libraries to implement it:

  • OpenMP - a standard (and a library, usually shipped with a compiler), widely used in external libraries;

  • TBB - a newer parallelization library optimized for task-based parallelism and concurrent environments.

OpenMP historically has been used by a large number of libraries. It is knownfor a relative ease of use and support for loop-based parallelism and other primitives.

TBB is used to a lesser extent in external libraries, but, at the same time,is optimized for the concurrent environments. PyTorch’s TBB backend guarantees thatthere’s a separate, single, per-process intra-op thread pool used by all of theops running in the application.

Depending of the use case, one might find one or another parallelizationlibrary a better choice in their application.

PyTorch allows selecting of the parallelization backend used by ATen and otherlibraries at the build time with the following build options:

Library

Build Option

Values

Notes

ATen

ATEN_THREADING

OMP (default), TBB

MKL

MKL_THREADING

(same)

To enable MKL use BLAS=MKL

MKL-DNN

MKLDNN_CPU_RUNTIME

(same)

To enable MKL-DNN use USE_MKLDNN=1

It is recommended not to mix OpenMP and TBB within one build.

Any of the TBB values above require USE_TBB=1 build setting (default: OFF).A separate setting USE_OPENMP=1 (default: ON) is required for OpenMP parallelism.

Runtime API

The following API is used to control thread settings:

Type of parallelism

Settings

Notes

Inter-op parallelism

at::set_num_interop_threads,at::get_num_interop_threads (C++)

set_num_interop_threads,get_num_interop_threads (Python, torch module)

Default number of threads: number of CPU cores.

Intra-op parallelism

at::set_num_threads,at::get_num_threads (C++)set_num_threads,get_num_threads (Python, torch module)

Environment variables:OMP_NUM_THREADS and MKL_NUM_THREADS

For the intra-op parallelism settings, at::set_num_threads, torch.set_num_threads always take precedenceover environment variables, MKL_NUM_THREADS variable takes precedence over OMP_NUM_THREADS.

Tuning the number of threads

The following simple script shows how a runtime of matrix multiplication changes with the number of threads:

import timeitruntimes = []threads = [1] + [t for t in range(2, 49, 2)]for t in threads: torch.set_num_threads(t) r = timeit.timeit(setup = "import torch; x = torch.randn(1024, 1024); y = torch.randn(1024, 1024)", stmt="torch.mm(x, y)", number=100) runtimes.append(r)# ... plotting (threads, runtimes) ...

Running the script on a system with 24 physical CPU cores (Xeon E5-2680, MKL and OpenMP based build) results in the following runtimes:

The following considerations should be taken into account when tuning the number of intra- and inter-op threads:

  • When choosing the number of threads one needs to avoid oversubscription (using too many threads, leads to performance degradation). For example, in an application that uses a large application thread pool or heavily relies oninter-op parallelism, one might find disabling intra-op parallelism as a possible option (i.e. by calling set_num_threads(1));

  • In a typical application one might encounter a trade off between latency (time spent on processing an inference request) and throughput (amount of work done per unit of time). Tuning the number of threads can be a usefultool to adjust this trade off in one way or another. For example, in latency critical applications one might want to increase the number of intra-op threads to process each request as fast as possible. At the same time, parallel implementationsof ops may add an extra overhead that increases amount work done per single request and thus reduces the overall throughput.

Warning

OpenMP does not guarantee that a single per-process intra-op threadpool is going to be used in the application. On the contrary, two different application or inter-opthreads may use different OpenMP thread pools for intra-op work.This might result in a large number of threads used by the application.Extra care in tuning the number of threads is needed to avoidoversubscription in multi-threaded applications in OpenMP case.

Note

Pre-built PyTorch releases are compiled with OpenMP support.

Note

parallel_info utility prints information about thread settings and can be used for debugging.Similar output can be also obtained in Python with torch.__config__.parallel_info() call.

CPU threading and TorchScript inference — PyTorch 2.3 documentation (2024)

FAQs

Is PyTorch multithreaded on CPU? ›

PyTorch uses a single thread pool for the inter-op parallelism, this thread pool is shared by all inference tasks that are forked within the application process. In addition to the inter-op parallelism, PyTorch can also utilize multiple threads within the ops ( intra-op parallelism ).

What is the difference between TorchScript and PyTorch? ›

TorchScript is an intermediate representation of a PyTorch model (subclass of nn. Module ) that can then be run in a high-performance environment like C++. It's a high-performance subset of Python that is meant to be consumed by the PyTorch JIT Compiler, which performs run-time optimization on your model's computation.

What does Torch trace do? ›

trace. Returns the sum of the elements of the diagonal of the input 2-D matrix.

What is Torch jit ignore? ›

torch.jit. ignore(drop=False, **kwargs)[source] This decorator indicates to the compiler that a function or method should be ignored and left as a Python function. This allows you to leave code in your model that is not yet TorchScript compatible.

What is the disadvantage of multithreading for a CPU? ›

The task of managing concurrency among threads is difficult and has the potential to introduce new problems into an application. Testing a multithreaded application is more difficult than testing a single-threaded application because defects are often timing-related and more difficult to reproduce.

How many threads can run on 1 vCPU? ›

A thread is a set of instructions that allows a CPU core to be split into multiple virtual (logical) cores to increase performance. One CPU core usually has two threads.

What is the benefit of TorchScript? ›

TorchScript gives us a representation in which we can do compiler optimizations on the code to provide more efficient execution. TorchScript allows us to interface with many backend/device runtimes that require a broader view of the program than individual operators.

Is TorchScript faster than Onnx? ›

ONNX influence on performance

We typically saw an increase of over 50% in the model speed performance when compared to raw PyTorch with ONNX being much faster than TorchScript!

Is PyTorch still relevant? ›

Real-World Applications: PyTorch is prominent in academia and research-focused industries, while TensorFlow is widely used in industry for large-scale applications. Future Prospects: Both frameworks are evolving, with PyTorch focusing on usability and TensorFlow on scalability and optimization.

What are the limitations of TorchScript? ›

It can reuse existing eager model code and can handle almost any program with exclusive torch tensors/operations. Its main drawback is that it omits all control flow, data structures, and python constructs. It can also create unfaithful representations without any warnings.

What is the difference between Torch FX and TorchScript? ›

torch. fx is different from TorchScript in that it is a platform for Python-to-Python transformations of PyTorch code. TorchScript, on the other hand, is more targeted at moving PyTorch programs outside of Python for deployment purposes.

Does Torch use GPU? ›

GPU acceleration in PyTorch is a crucial feature that allows to leverage the computational power of Graphics Processing Units (GPUs) to accelerate the training and inference processes of deep learning models. PyTorch provides a seamless way to utilize GPUs through its torch.

What is the difference between detach and data in Torch? ›

detach() is to detach a tensor from the network graph, making the tensor no gradient, while '. data' is only to obtain tensor-data from Variable. They have different functions, which are used in different cases.

Does Torch empty allocate memory? ›

The torch. empty() call allocates memory for the tensor, but does not initialize it with any values - so what you're seeing is whatever was in memory at the time of allocation.

What is the difference between torch tensor and torch zeros? ›

torch. Tensor does not initialize the memory, so you get whatever was there before. torch. zeros actually zero out the memory so you get a Tensor full of 0 .

Does PyTorch run on CPU or GPU? ›

WML CE includes GPU-enabled and CPU-only variants of PyTorch, and some companion packages. The GPU-enabled variant pulls in CUDA and other NVIDIA components during install. It has larger installation size and includes support for advanced features that require GPU, such as DDL, LMS, and NVIDIA's Apex.

Does PyTorch use all CPU cores? ›

Key Takeaways: Pytorch Use All CPU Cores

PyTorch can efficiently utilize all CPU cores for faster computation. With PyTorch, you can use the entire CPU power to train and evaluate your deep learning models. By default, PyTorch utilizes only a single CPU core, but you can change this setting.

Is PyTorch faster than TensorFlow on CPU? ›

In general, TensorFlow and PyTorch implementations show equal accuracy. However, the training time of TensorFlow is substantially higher, but the memory usage was lower. PyTorch allows quicker prototyping than TensorFlow. However, TensorFlow may be a better option if custom features are needed in the neural network.

Is PyTorch faster than NumPy on CPU? ›

For small matrix operations on CPU (e.g., 3x3 matrices), NumPy seems to be approximately 15x faster than PyTorch. This is relevant for operations in the data loader or (geometric) transformations where I could achieve great speedups by using NumPy arrays instead tensors.

References

Top Articles
Latest Posts
Article information

Author: Duane Harber

Last Updated:

Views: 6603

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Duane Harber

Birthday: 1999-10-17

Address: Apt. 404 9899 Magnolia Roads, Port Royceville, ID 78186

Phone: +186911129794335

Job: Human Hospitality Planner

Hobby: Listening to music, Orienteering, Knapping, Dance, Mountain biking, Fishing, Pottery

Introduction: My name is Duane Harber, I am a modern, clever, handsome, fair, agreeable, inexpensive, beautiful person who loves writing and wants to share my knowledge and understanding with you.