[Rdap] I give up

Joe Hourcle oneiros at grace.nascom.nasa.gov
Thu Jun 2 10:00:13 EDT 2016



On Wed, 1 Jun 2016, Mike Smit wrote:

> GitHub was my first thought as well, but GitHub caps regular files to
> <100MB, and files uploaded using its Large Files protocol to 2GB. I believe
> this is due to the challenge of versioning large files, as opposed to
> storage limits, so one could try uploading 1000 1GB files.  I suspect this
> would attract GitHub's attention in a not-very-positive way.  (git as a
> tool requires free storage space equal to used storage space, so 2TB of
> disk would be required: $60/month at market cloud rates).


The git standard itself has problems both indexing large files, and in 
dealing with a 'git repo' that's of significant size (ie, larger than can 
fit all in memory at once).

You might have to break it into smaller chunks, and then create a seperate 
repo for each one.



> I would suggest that for data generated by software, that the "right"
> approach is to release the software, ideally open-source, with appropriate
> documentation for re-generating that data.  Unless the data generation time
> period is measured in weeks, months, or years, that would require less
> resources over time than storing and serving large data files. CPU time is
> cheap compared to data transfer prices.

I would still store a subset of the generated data for validation.  (in 
case a problem library or similar could result in the data being 
re-generated incorrectly).

I've also heard of people using virtual machines for this, and then 
storing the whole VM ... but that gets into some messy issues if you're 
not using 100% open source, as you won't have the right to distribute it 
to others.  (the original site would have to maintain their software 
licenses so they could re-generate the data).


> Computer science is a pretty broad area - there are repositories, but they
> are more focused than the discipline (e.g. machine learning repositories).
>
> Has anyone built a data repository that distributes files using BitTorrent?
> That's the kind of thing a computer scientist would get excited about.

I thought about, but when looking into it, it just wouldn't fit for the 
data we're serving:

 	http://opendata.stackexchange.com/q/283/263


-Joe



More information about the RDAP mailing list