../operating-influxdb-historical-data
Ingesting historical data into InfluxDB
Published:
Context
A friend and I use InfluxDB (v2.7.1) to store time-series data for our joint side project. Every week, we find say a couple time-series that interest us, and ingest all data points spanning about 50-100 years into the past into InfluxDB.
Recently, we've noticed that our InfluxDB docker container crashes frequently on writes. Since the time-series database is a central dependency for our project, this is unacceptable and we set out to root-cause the issue.
Some details:
- InfluxDB version: 2.7.1 (also saw issue in 2.4)
- Minikube version: 1.27.1 (we run InfluxDB inside Minikube thru K8s via the official alpine docker image)
- Specs: 10G memory, 12 cpus, no limits in Docker spec, so InfluxDB has access to all of this
Problem overview
InfluxDB pod logs showed the following:
$ kubectl logs influxdb-stateful-set-0 --previous > /tmp/logs.out
$ cat /tmp/logs.out
...
runtime/cgo: pthread_create failed: Resource temporarily unavailable
SIGABRT: abort
PC=0x7f1baa892f13 m=2 sigcode=18446744073709551610
goroutine 0 [idle]:
runtime: unknown pc 0x7f1baa892f13
stack: frame={sp:0x7f1b815f43f8, fp:0x0} stack=[0x7f1b80df4d58,0x7f1b815f4958)
0x00007f1b815f42f8: 0x0000000000000037 0x0000000000000000
0x00007f1b815f4308: 0x0000000000000000 0x0000000000000000
0x00007f1b815f4318: 0x0000000000000001 0x00007f1bacfc
...
The Go dump spans about 50 thousand more lines and I have omitted it here.
We see that a pthread_create
failed.
I then look inside minikube
VM to see what went wrong.
First, I ssh-ed into minikube via minikube ssh
, and then I ran sudo dmesg
and the following line stood out:
cgroup: fork rejected by pids controller in /system.slice/docker-fcd16bcd491dc3e88fabe9cc504114a47218dbeaddcb7958db9d13c929823f2d.scope/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod78140808_1a7c_4b38_b083_6b5a82381b44.slice/docker-86a3884408f0075dc171fd40444ba21f164045a8970eb8a56ab0168aa93ecab5.scope
So our InfluxDB process was hitting the cgroup limit that docker set for the process count.
I ran the following watch -n 0.01 systemctl status kubepods-besteffort-pod78140808_1a7c_4b38_b083_6b5a82381b44.slice
while repro-ing the issue to see what the Task count
was going to, but it only hit a couple hundred before InfluxDB died. The limit was around 2000 tasks.
Then, instead of checking for the thread count via minikube, I instead ssh-ed into InfluxDB container and ran the following script while repro-ing the issue:
$ cat script.sh
#!/bin/sh
cat /sys/fs/cgroup/pids.current
cat /sys/fs/cgroup/pids.max
cat /sys/fs/cgroup/pids.events
echo "---------------"
$ chmod +x ./script.sh
$ watch -n 0.01 ./script.sh
Indeed, I saw pids.current
get 2000+ (around pid.max
) before InfluxDB died.
So I just chalked up the systemctl
Task count
anomaly to systemctl
maybe having some lag in its data and accepted that InfluxDB was for some reason spinning up 2000+ processes when we sent it write requests.
I then decided to look at logs from InfluxDB before it crashed to see why it might be spinning up so many processes. What stood out was how dense the log was. There were thousands of log line within a second. That didn't seem right. And a lot of log lines originated from the storage controller and seemed related to opening some shard file.
I looked at the number of shards we had in our bucker via ls /var/lib/influxdb2/data/$BUCKET_ID/autogen/
and it stood out. I dove deeper into docs [1] and [2] in the reference and also stumbled on doc [3].
Indeed, it seemed like by default, InfluxDB creates a shard per 7 day if your bucket has infinite retention, which ours did. So when we wrote data spanning the last 100 years into the database, it had to CRUD across a lot of shards.
You can run the following command to check your shard period
$ influxdb-stateful-set-0:/# influx bucket list --org $ORG_NAME --token $TOKEN
ID Name Retention Shard group duration Organization ID Schema Type
6a8934b488296ff3 _monitoring 168h0m0s 24h0m0s REDACTED implicit
6ce28becfaed76d2 _tasks 72h0m0s 24h0m0s REDACTED implicit
REDACTED REDACTED infinite 168h0m0s REDACTED implicit
As we see above, we only had 168h for our shard period.
Solution
We set out to update the shard of our existing bucket, but since it already had data, that didn't seem to really work. At least not 5 minutes after shard update.
$ influxdb-stateful-set-0:/# influx bucket update --token $TOKEN --shard-group-duration 80w --id $BUCKET_ID --retention 0
So then, I deleted the bucket to ensure all shards were deleted, and then re-created an empty bucket with proper shard period, and then verified that we no longer see the issue.
$ influxdb-stateful-set-0:/# influx bucket delete --name $BUCKET_NAME --org $ORG_NAME
$ influxdb-stateful-set-0:/# influx bucket create --name $BUCKET_NAME --token $TOKEN --shard-group-duration 100w --retention 12000w --org $ORG_NAME
Conclusion
InfluxDB defaults weren't lined up with our use-case. Hope this helps others using InfluxDB for a similar use-case.