Coinbase Pro (previously Gdax) has a nice little API for pulling market data. So I decided to use it to scraping all open orders once per minute. I’ve made this data publicly available via an S3 bucket. Filenames represent a days worth of data in the format of year-month-day.hdf
. The bucket can be found here:
https://cryptoexchanges.veraciousdata.io/
How to Use
The bucket itself is public, so any set of creds can list and pull objects. Object keys are prepended with the exchange the data was pulled from, though at the moment I’m only scraping Coinbase:
1 |
|
The data is written via HDF5 with a single file dedicated to each day. I used the h5py
Python module, but any HDF5 reader would suffice:
1 |
|
The HDF itself is split into groups, one for each set of queries, named by the UTC timestamp the scrape was started:
1 |
|
Within each group, there’s a bids
, asks
, and price-points
subgroup. The bids
and asks
groups are the complete order books listed on Coinbase at that time. They were pulled via the /products/BTC-USD/book?level-3 API call. Each item within bids
and asks
can be retrieved via the price
and size
keys.
1 |
|
You can also use standard numpy
ndarray slicing syntax. For example, if you wanted to get just the prices of each open ask order:
1 |
|
Can we just take a quick second to note that there are ask orders open for $8.38e+09? Those are definitely going to get filled, guy.
The price-points
group has the buy
, sell
, and spot
prices for not only BTC, but ETH, XRP, and LTC as well. The trade volume for these coins is regularly pretty high, and a fair margin higher than the rest of the shitcoins out there, so figured it’d be worth collecting as well:
1 |
|
Deployment
The data is harvested via a Python3 script using h5py
(obviously) and twisted
. Originally, I had written it as a “run once” job, executing once per minute as a k8s cronjob. Unfortunately, what I found was that the k8s scheduler didn’t perform as expected. Take a look at the timestamps from the original HDF:
1 |
|
The script takes several seconds to run, as I included some sleep statements to keep from tripping the rate limits. Accounting for that and container mount time, I can definitely see the total job execution time being around 7 seconds – the exact amount of time each job is getting offset by. This makes me suspicious that the k8s cronjob
type actually just sleeps, instead of schedules, jobs. Though for what it’s worth, I haven’t deep dived the code to see if that’s actually what’s happening.
Regardless, a k8s cronjob wasn’t going to suit my needs. So I used twisted
as a scheduler within a standard deployment. The (ugly-as-sin) code can be found here, as well as the Dockerfile.