In web scraping or data collection tasks, ensuring your scripts run efficiently and don’t hang indefinitely is crucial. The requests
library in Python simplifies HTTP requests but doesn’t provide a straightforward way to handle total execution time. This tutorial covers how to effectively implement timeouts using the requests
library to ensure robustness in your web scraping endeavors.
Understanding Timeouts
The timeout
parameter in requests.get()
serves a specific purpose: it limits the time spent waiting for a response from the server, rather than capping the entire request’s execution. It is applied separately to connection establishment and data reading phases:
- Connect Timeout: The maximum time allowed to establish a connection with the server.
- Read Timeout: The maximum time permitted for receiving all data from the server after establishing a connection.
You can specify these timeouts individually or together using a tuple: (connect_timeout, read_timeout)
.
Using requests.get
with Timeouts
Here’s a basic example of how to set a timeout when making an HTTP GET request:
import requests
try:
response = requests.get("http://example.com", timeout=10) # Total timeout for connection and reading.
except requests.exceptions.Timeout:
print("The request timed out")
In this code, the timeout
parameter is set to 10 seconds. If either establishing a connection or receiving data exceeds this time, an exception is raised.
Advanced Usage with Tuple Timeouts
If you need more granular control over connect and read timeouts:
import requests
try:
response = requests.get("http://example.com", timeout=(5, 10))
except requests.exceptions.Timeout:
print("The request timed out")
Here, timeout
is set as a tuple (5, 10)
, meaning the connection must be established within 5 seconds and data reading must complete within an additional 10 seconds.
Ensuring Total Execution Time with sys.settrace
For cases where you need to enforce a total execution time limit (including all processing after receiving data), Python’s sys.settrace
can be used. This approach is more complex but ensures your request doesn’t hang indefinitely due to prolonged processing:
import requests
import sys
import time
TOTAL_TIMEOUT = 10 # Total allowed execution time in seconds.
def trace_function(frame, event, arg):
if time.time() - start > TOTAL_TIMEOUT:
raise Exception('Total timeout exceeded!')
return trace_function
start = time.time()
sys.settrace(trace_function)
try:
response = requests.get("http://localhost:8080", timeout=(3, 6))
except Exception as e:
print(e)
finally:
sys.settrace(None) # Reset to remove the tracing function.
# Handle the response here.
This setup uses sys.settrace
to raise an exception if any part of your request processing exceeds a predefined total execution time. Note that while this controls Python-level operations, network delays are still managed by the timeout
parameter.
Best Practices
- Graceful Error Handling: Always handle exceptions like
requests.exceptions.Timeout
to manage errors gracefully. - Testing and Tuning: Test your timeouts in a real environment since network conditions can vary widely.
- Combining Techniques: Use both request-specific timeouts and
sys.settrace
for comprehensive control.
By implementing these techniques, you ensure your scripts remain responsive and efficient, avoiding pitfalls associated with uncontrolled execution times during web scraping tasks.