Frequent 500 errors from API (and on searches, dashboards...)

Hey all

So we’ve been building out our Graylog infrastructure, and are getting to the point where we want to present it to clients. Naturally, we need to turn on TLS, so I spent some time doing that today, following the guide here: http://docs.graylog.org/en/2.4/pages/configuration/https.html

Once I had it working, though, I started getting errors on my dashboard widgets. About 40% of them don’t load initially, showing ‘N/A’, with the red triangle exclamation icon, hovering over showing: “Error loading widget value: cannot GET… (500)” If I wait a minute or so, they will eventually load. Searches exhbit similar behvaiour, often not loadining initially with a 500 error.

I tried disabling TLS, but the issue persisted. I tried updating to the latest graylog version, but the error persisted. My elasticsearch cluster is healthy and green. Messages are being ingested at around 1000-2000 per second, and results seem up-to-date when the searches don’t return 500.

Example of performing a search via API (partly redacted):

# curl -vv "http://ro-user:********:9000/api/search/universal/absolute?query=*****************&from=2018-05-24%2000%3A00%3A00&to=2018-05-24%2000%3A05%3A00&fields=source&filter=streams%3A59df7d23da1a031aaea70e66&limit=1&decorate=false"
*   Trying *****...
* TCP_NODELAY set
* Connected to ******* port 9000 (#0)
* Server auth using Basic with user 'ro-user'
> GET /api/search/universal/absolute?query=*************&from=2018-05-24%2000%3A00%3A00&to=2018-05-24%2000%3A05%3A00&fields=source&filter=streams%3A59df7d23da1a031aaea70e66&limit=1&decorate=false HTTP/1.1
> Host: ***********:9000
> Authorization: Basic **********
> User-Agent: curl/7.58.0
> Accept: */*
> 
< HTTP/1.1 500 Internal Server Error
< X-Graylog-Node-ID: 4a6e00a9-5b27-4241-b1ba-cbad1f430f18
< X-Runtime-Microseconds: 306860
< Content-Type: application/json
< Date: Fri, 08 Jun 2018 21:08:42 GMT
< Connection: close
< Content-Length: 57
< 
* Closing connection 0
{"message":"Unable to perform search query","details":[]}

Aaaaaand… I figured it out.

When I was configuring TLS, I had added my thrid ES node to the list of ‘elasticsearch_hosts’ config line, but I had specified port 9300 instead of 9200, which is open, but is not HTTP. So, whenever graylog tried to connect to that port and make an HTTP request, it would error out, but connecting to the other nodes would work fine, which is why about 40% of the dashboard widgets wouldn’t load right away.

Move long, everyone, nothing to see here, but someone with very fat fingers…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.