1) How many pages are in your index
2) How do you do indexing and retrieval? Do you build a word index by document and find documents that match all words in the query?
Is there a domain list if I wanted to crawl the hosts myself? I see you have the raw crawl data, which is appreciated, but a raw domain list would be cool.
I guess technically that could be arranged. Although I don't want everyone to run their own crawler. It would annoy a lot of webmasters and end up with even more hurdles to be able to run a crawler. Better to share the data if possible.