Ruby HTTP request gets “end of file reached” after 60 seconds

Over the past couple days, I’ve been trying to figure out why long-running (due to large file upload & synchronous processing time) HTTP requests to our REST API are throwing “Faraday::Error::ConnectionFailed: end of file reached” after one minute.

The client is: Ruby 1.9.3-p194, Faraday 0.8.5 using the default Net::HTTP adapter, and Hyperclient 0.3.0.

The server is: Rails 3.2.12 on Ruby 1.9.3-p194, via Apache with Passenger.

The exception looks like:

Caught Faraday::Error::ConnectionFailed: end of file reached
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock'
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/protocol.rb:141:in `rbuf_fill'
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil'
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/protocol.rb:132:in `readline'
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:2562:in `read_status_line'
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:2551:in `read_new'
/usr/local/rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/net/http.rb:1319:in `block in transport_request'

The 60 second timing was rather mysterious for some time because:

  • The timeout option of Net::HTTP in the client had no effect, and even if it did, I would expect to see a Timeout::Error, not an EOFError.
  • Extending the Apache Timeout setting had no effect, even though Apache was sometimes logging the following error:
[ pid=6209 thr=140145517742048 file=ext/apache2/Hooks.cpp:808 time=2013-02-25 20:03:36.394 ]:
Either the visitor clicked on the 'Stop' button in the web browser, or the visitor's
connection has stalled and couldn't receive the data that Apache is sending to it. As a
result, you will probably see a 'Broken Pipe' error in this log file. Please ignore it, this
is normal. You might also want to increase Apache's TimeOut configuration option if you
experience this problem often.

We were also seeing errors in Apache’s logs like:

[ pid=13250 thr=139854544902112 file=ext/apache2/Hooks.cpp:810 time=2013-02-22 20:19:49.359 ]:
No data received from the back end application (process 27551) within 300000 msec. Either
the backend application is frozen, or your TimeOut value of 300 seconds is too low. Please
check whether your application is frozen, or increase the value of the TimeOut configuration
directive.

That means Apache wasn’t hearing anything back from Passenger/Rails for 300 seconds. Sometimes, we’d also see several Ruby (Rails) processes running for long periods of time and using a large share of the CPU. Normally, Passenger only keeps one Ruby process around, but in this case Ruby processes were remaining and even being orphaned out of the passenger-status listing. I’m still not sure why those processes were getting orphaned (was their work really taking longer than 300 seconds?).

I was eventually able to get rid of the EOF problem by bypassing the Amazon Load Balancer (aka Elastic Load Balancer or ELB) in front of our EC2 instances, which finally allowed me to pinpoint the source of the problem. Browsing the internet eventually brought me to this thread: Elastic Load Balancer: HTTP Connections time out after 60 seconds. The potential solution mentioned in the thread is that Amazon can increase the timeout on your load balancer.

While I’m happy to have a possible solution, I’m not sure why the failure occurs exactly one minute after the request initiates. I would expect that sending the body of the request would take a significant amount of time (>30 seconds), during which the connection would have traffic and therefore the ELB timeout would not start ticking. However, the software always threw the EOF error after exactly one minute.

In any case, I’m hopeful that asking for an extension to the load balancer timeout will resolve the issue. I’ll try to remember to update this post once I verify the solution.

2 thoughts on “Ruby HTTP request gets “end of file reached” after 60 seconds”

  1. To be honest, I’m not certain but I think it helped. The application I was working on had a lot of problems with response times due to issues with the internal architecture.

    If you are having a similar problem, one thing you can try is actually running Nginx as a reverse proxy (proxy_pass) on each instance that runs your application. Requests would come in to the ELB, then go to Nginx on one of your instances, and then to your app. If you are using Docker, that’s how it works by default. What this allows you to do is turn on various forms of logging in the Nginx on your instances. For example, you may find out that the ELB (the client, from Nginx’s perspective) is terminating the connection prematurely (due to timeout). The logging provided by Nginx gives you some more visibility into the behavior of the ELB separately from your app. I was just recently debugging an issue with an application that had high max latency values on the AWS ElasticBeanstalk graph. As it turns out, those latency values are calculated from when the first byte of the request arrives to when the request finishes or is terminated. Since our app serves traffic to the general public, there are a lot of slow & unreliable clients (mobile phones on cellular networks, for example) that cause the request to take long enough that the ELB times it out. So, in this case, our app is not causing the issue. I wouldn’t have been able to diagnose the issue as easily if I didn’t have Nginx logging available.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.