Over the past couple days, I’ve been trying to figure out why long-running (due to large file upload & synchronous processing time) HTTP requests to our REST API are throwing “Faraday::Error::ConnectionFailed: end of file reached” after one minute.
The client is: Ruby 1.9.3-p194, Faraday 0.8.5 using the default Net::HTTP adapter, and Hyperclient 0.3.0.
The server is: Rails 3.2.12 on Ruby 1.9.3-p194, via Apache with Passenger.
The exception looks like:
Caught Faraday::Error::ConnectionFailed: end of file reached /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock' /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/protocol.rb:141:in `rbuf_fill' /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil' /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/protocol.rb:132:in `readline' /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:2562:in `read_status_line' /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:2551:in `read_new' /usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:1319:in `block in transport_request'
The 60 second timing was rather mysterious for some time because:
- The timeout option of Net::HTTP in the client had no effect, and even if it did, I would expect to see a Timeout::Error, not an EOFError.
- Extending the Apache Timeout setting had no effect, even though Apache was sometimes logging the following error:
[ pid=6209 thr=140145517742048 file=ext/apache2/Hooks.cpp:808 time=2013-02-25 20:03:36.394 ]: Either the visitor clicked on the 'Stop' button in the web browser, or the visitor's connection has stalled and couldn't receive the data that Apache is sending to it. As a result, you will probably see a 'Broken Pipe' error in this log file. Please ignore it, this is normal. You might also want to increase Apache's TimeOut configuration option if you experience this problem often.
We were also seeing errors in Apache’s logs like:
[ pid=13250 thr=139854544902112 file=ext/apache2/Hooks.cpp:810 time=2013-02-22 20:19:49.359 ]: No data received from the back end application (process 27551) within 300000 msec. Either the backend application is frozen, or your TimeOut value of 300 seconds is too low. Please check whether your application is frozen, or increase the value of the TimeOut configuration directive.
That means Apache wasn’t hearing anything back from Passenger/Rails for 300 seconds. Sometimes, we’d also see several Ruby (Rails) processes running for long periods of time and using a large share of the CPU. Normally, Passenger only keeps one Ruby process around, but in this case Ruby processes were remaining and even being orphaned out of the passenger-status listing. I’m still not sure why those processes were getting orphaned (was their work really taking longer than 300 seconds?).
I was eventually able to get rid of the EOF problem by bypassing the Amazon Load Balancer (aka Elastic Load Balancer or ELB) in front of our EC2 instances, which finally allowed me to pinpoint the source of the problem. Browsing the internet eventually brought me to this thread: Elastic Load Balancer: HTTP Connections time out after 60 seconds. The potential solution mentioned in the thread is that Amazon can increase the timeout on your load balancer.
While I’m happy to have a possible solution, I’m not sure why the failure occurs exactly one minute after the request initiates. I would expect that sending the body of the request would take a significant amount of time (>30 seconds), during which the connection would have traffic and therefore the ELB timeout would not start ticking. However, the software always threw the EOF error after exactly one minute.
In any case, I’m hopeful that asking for an extension to the load balancer timeout will resolve the issue. I’ll try to remember to update this post once I verify the solution.