Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Can web scraping be the basis of a viable business model?
171 points by fforflo on Dec 18, 2021 | hide | past | favorite | 111 comments
I'm a data engineer at heart, and I never did or enjoyed front-end work. Having said that I always was happy to code and evolve crawlers and web scrapers. Now I've taken some time off from work and gigs and I'm working on a side-project I've been hacking for some time.

Without getting into the details yet: it aims to make web data collection a little bit easier for non-devs. I'll soon have an MVP and will start pitching to investors: aiming for an open-source business model (after a few months of stealth development) and eventually a typical SaaS offering for extra functionality.

At this point I'm trying to consolidate and counter the steel-man counter-arguments I should expect from investors. The most obvious one: as one can imagine, the product it's not magic and, after a certain point it does require some manual work from the customer, hence this is an aspect I should prepare for.

I have done some preliminary analysis of the space of potential competitors (think import.io, Apify, Zyte/ScarpingHub, etc.) and described opportunities for differentiation. What I'm afraid of is getting sidetracked in a discussion of "um, this is web scraping and it's hard to make a business on top of it".

I understand that there's not much context now and one could easily say "well yeah, anything could be possible with a good team, product...", but I'm reaching out to the HN community to gather some considerations, mental models and pointers, I may not think of myself at this point.



Google, a trillion dollar company, is essentially the world's largest web scraper. So...yes! You'll almost certainly find a way to monetize that.

Monopolies, lobbying and protectionism got in the way of keeping the web truly machine readable. There's tremendous value in restoring some of it.


> Google, a trillion dollar company, is essentially the world's largest web scraper.

Even just considering the parts of Google that it takes to bring you the N blue links part of the Google SERP, the web scraper is probably the least interesting and significant piece of technology in the stack. It's beyond reductive to say that Google is in essence a large web scraper, or a web scraper of any kind. It is like saying that a person is, in essence, the world's largest mouth.

I think it's incredibly difficult to build a profitable business in this space. The number of customers who a) need to scrape the web b) aren't sophisticated enough to do it themselves and c) are big enough to make $$$ from are small. The important bit is always processing the web pages for whatever content is salient to the given customer. Which means that you need to deliver the web pages to them. So effectively your business is providing nothing more than URL lists and potentially some additional metadata compared to what the customer would get if they fetched the pages themselves. There are definitely some other complexities you could resolve, but it's hard to imagine that those benefits would be enough to build a business on.


why is Google's web scraping boring/easy? I'm sure it must be very sophisticated/optimized since they want it to be done often and cheaply


It is not boring, it is just not as interesting/difficult as the remainder of the stack.


> Monopolies, lobbying and protectionism got in the way of keeping the web truly machine readable.

Exactly and that ship has long since sailed. The good ship Web 3.0 (semantic web) launched in ‘99 and was a ghost ship until recently when it was boarded by crypto pirates now flying the web 3.0 flag.

> There's tremendous value in restoring some of it.

To this comment and OP, my startup is using web scraping to pre-populate machine-readable data for a DNS-based protocol called NUM [0]. So as others have said, whilst web scraping itself may be difficult to build into a viable business, it can be a key component of a viable business. Email in my profile if you want to discuss.

0. https://num.uk/blog/we-crawled-5m-uk-websites-and-published-...


Web3 (without the decimal point) these days usually refer to internet Ponzi schemes.


"boarded by crypto pirates now flying the web 3.0 flag." --Sad truth

The Semantic Web would have given much better search among other benefits.


And there's another entire ecosystem around scraping Google's results and wrapping it in an API.

Just scraping upon scraping.


> keeping the web truly machine readable. There's tremendous value in restoring some of it.

:-)


Web scraping is just one way to get the data to facilitate the sale and service of ads.


I worked at AboutUs.org for a while, and that’s what we did. Good news: it was fun and rewarding. In many ways it felt like a satisfying old-skool problem: scrape, find edge case, patch it, scrape again. We were scraping 100 million domains once a week with a team of six engineers, one UX (me), and Ward Fucking Cunningham as wiki expert. Ward in particular was great at prototyping solutions.

It is an arms race, since many people don’t want you to scrape. We tried hard to respect robots.txt, but we still got angry cease-and-desist emails from people who’d malformed or misconfigured the file.

You will have a scale problem: it’s a lot of data. You’ll have parsing problems: live HTML is about the dirtiest data set I’ve ever seen. Refresh rate can be a major competitive advantage: how often can you scrape, store, diff, and report? These days you’ll need first-class JavaScript execution to catch dynamic content.

But the biggest problem isn’t the scraping tech, it’s the use case — what uses cases are you going to afford your early users? You don’t mention this in your post, and it will non-trivially affect what you scrape and how you report it. I’d encourage you to find users who have business problems that can be solved by paying money for scraping. Otherwise you’ll be another interesting open source tool that no one’s figured out how to monetize. Do this _before_ you talk to investors or take their money.


People make their own blogs and sites with React and other JavaScript frameworks all the time now... It won't even be actually dynamic content before you need to be able to execute JavaScript.


You've already identified some of your competitors, you should go in to more detail and try to answer:

* What features are common among my competitors?

* What features are unique?

* Who are target customers and users? Is there any overlap, or do some competitors target unique market segments?

This last question ties in to a discussion I was having with a friend recently. In B2B sales, your customers are businesses, but your users are people in those businesses with certain roles and responsibilities. Understanding the difference is key, because you will often need to develop your sales and marketing strategies based on the business/customer profile, but your UX will depend on the needs of the users within those businesses.

In my opinion you are more likely to be successful if you can get an initial foothold in a market by identifying a specific target of customers and users, solving their use case very well, developing a moat, and then growing out from that foothold to provide a wider set of options. Web scraping is just a tool. You need to find businesses who can gain value from scraping or from scraped data. Are there businesses who, for whatever reason, would not be able to adopt one of your competitors' products, or would find that adoption difficult? Maybe you could specialise in scraping a particular kind of data, or providing a full-stack solution for companies with limited in-house technical expertise (like some kind of consulting, you hop on a call with the client, they tell you what they want to scrape, and you set up a hosted solution which provides a SQL or Excel interface to the data).

In short, successful product development is all about understanding customer and user pain and needs. If you can find pains or needs which are a common theme for a particular demographic of companies and roles, you can work with those people to understand their problems and make a product which is very valuable to them.


Thanks a lot for your comment.

You're exactly right on the tricky relationship between developers as internal ambassadors to businesses - customers. I think it somehow applies to almost any developer-facing tool.

Your recommendation to focus on a specific vertical at first makes sense too. Helps prioritising the backlog as well.


Yes, although I would encourage you to think about something higher on the value chain than raw data feeds. Those exist and have become an increasingly difficult market to compete in. You can buy a custom feed for like $250/mo.

Instead, think about what people want to do with the data. For example, if you are going to scrape diamond prices, don’t try to sell that feed. Set up a website with a UI so people can research diamond prices, and get alerts when specific thresholds are met or items come in stock. Monetize with ads.


Like camelcamelcamel.com I guess?


Exactly. The data is interesting, but a commodity. If the market is big enough, figure out what they’re using it for and help them use it. Sell access to processed data.


Sure, it can be! Also, as some people have already pointed out, this is often a gray area where people go beyond violating ToS. Some good examples are privacy violations (scraping personal data), credentials stuffing etc.

Recently, there is a boom of "anti-bot" services. These are essentially SaaS businesses that "protect" websites from being scraped by automated software. As you onboard the first customer who wants to extract data from a bot-protected website, you are going to run into an unlimited waterfall of stupid troubles. Your bots will be blocked, will consume excessive amount of data, kill your CPU/GPU performance.

I have shared some highlights on how to bypass these recently on HN [1], but it is sadly only the tip of the iceberg. On the other hand, since the post has been featured on HN I have been reached by more than 50 companies and individuals whose business operating model is based solely on data extraction/automated scraping. These are (in my opinion) successful companies, and two out of these are part of YC.

[1] https://news.ycombinator.com/item?id=29060272


Wasn't there a ruling that web scraping was legal now?


The LinkedIn case, it's still up in the air i think - https://news.bloomberglaw.com/us-law-week/supreme-court-scra...


Thanks for this!


I co-founded https://packetstream.io

There are a handful of companies doing very well with models similar to what you’re describing. I can’t mention specific customers, but I see some of them doing very large scraping volume through our network.

It’s an industry where having a good product is more important than the amount you’re spending on marketing. If developers are happy with your product they’ll take it with them to future companies/projects and share it with colleagues.

It can be a cat-and-mouse development cycle where the sites you target break your functionality and you’ll have customers that will want fixes to be implement ASAP because they rely on your tools to make money.

I don’t know what you’re building exactly, but keep in mind there’s a good chance that you’ll need to commit to long-term, continuous, rapid development cycles if you want to retain customers.

Best of luck!


My dude, how do you actually use your service (loading URLs)?

I just signed up, got sent to a dashboard…NO API DOCUMENATATION, just a link to download an app (!) for people who want to sell their residential bandwidth.

See ScraperAPI for a company that does API documentation well in this space. (I've spent well over $100K with them.) Or Stripe. PUT THE CODE UP FRONT.

There's also no API docs here, under "Support": https://packetstream.io/support/faq

Update: I see it's hidden under "Network Access." This kind of thing should be OBVIOUS, not the sixth option (looks like all of the others) on some random SaaS dashboard.


Did it back in the old days, scraping stock quotes to build a database for display by our Java app and web services. Called NetProphet, it would do a score of trend lines etc as overlays.

I wrote the scraping code. Had a list of sites and macros for extracting quotes, updated every day to every customer. If one quit working (the site attempted to prevent scraping) the app would use another and give a notice back to me. I'd tweak the macro for that site, and we'd be back scraping it the next day.

We eventually hired a finance student (Josh Hatwich, now a fellow at Adobe) to parse a Comstock satellite feed we put on the roof. That ended the era of scraping at StockPoint.


Yes, but scraping is a small part of the overall puzzle. As developers, we overestimate how valuable tools are (as opposed to solutions). I think the better opportunity is not to be another scraping-as-a-service provider, but to niche down to a solution that uses your scraping technology.


If you start yet another scraping-as-a-service provider, you're attempting to provide a paid service for people who just want to steal content for free. Not gonna work.


Scraping as a service can be profitable if you target people who are looking for leads to cold email/spam. Lead gen is one of the few areas you can easily charge $100 as your entry level package and at that price it's very easy to make money if you have half decent marketing.


There are legit use cases for browser automation, but a ton of bad actors too. I think your point illuminates something that many fail to fully consider: What are all the ways my service/software can be abused?


Some major issues from my experience web scraping:

1. Changes in data structures. If some site randomly decides to alter the format of their json/xml objects for their frontend api it may brake your scraper and anything that relies on that scraper’s output.

2. Security controls like rate limiting, captcha, ip blacklisting, auth systems.

3. Html which is rendered via complicated client side JavaScript blobs or web sockets. You’ll need a Headless browser engine like selenium and some site-specific parsing logic.

4. Legal issues.


Web scraping is a legal gray area in many or most jurisdictions. In some jurisdictions, depending on the tos of the web site itself, scraping it might be illegal. In others republishing the scraped information in any form might be illegal. In others still you might not be allowed to use the scraped data for any commercial purpose.

"But what about Google?" Google is worth 100 billion dollars and can play by completely different rules than scraping startups.


Yeah, I was going to say something like this. And one step better than Google, but still challenging, there are also a lot of sites out there that prohibit scraping but make an exception for "general purpose search engines". So if you're providing a specialized scraper that's used for specific purposes, you'd likely run into problems there too.

As well as the actual legal challenges, you'll also have the perception thereof, which could make businesses wary of relying on your services.


Google also plays nice. I manage a few sites that get hammered by scrapers to the extent that it causes big spikes in CPU, something that doesn't happen with G.


So many people don't even think about number of requests and often want the job to end as quickly as possible. For me, that is thinking short term, I've seen requests per second as low as 4 cause performance issues for a company. If you want to scrape long term you need to think requests per minute. That is unless you're dealing with a mega site then it's keeping a low request per second.


>Google is worth 100 billion dollars

$100B from $2T*. :-)


From what i know thats how skyscanner started, not sure if ita very popular in US, but it is in Europe. Now they're paying/having deqls for data, but they started with scraping the hell out of airline sites.


Hey! Hopefully this comment comes across as helpful rather than hurtful. I'm also the founder of a developer tool, and it's hard to raise money for! (I know you said yours isn't strictly a developer tool, but you mentioned open source so I think it's fair to assume it falls in this category.)

I think you're missing a step, which is where _you_ answer if web scraping can be a viable business model. You're attempting to convince VCs with logic (and a few assumptions), but there's an easier (or harder) way to do it... convince them by making money.

Most VCs aren't ideologues, and don't have an opinion about business models. They will be convinced if you simply show them you're making money. It's not their job to decide if an idea can make money or not; that's your job as a founder.

I applied to YC twice. The first time we spent the 10 minutes talking about if it could make money or not, and never got anywhere good. We got rejected. The second time we were making money, the conversation was smoother, and we got in. It's so much easier to be able to replace "I believe" with "our customers believe". It changes the conversation completely. You don't need to be making billions of dollars; just enough to show that people want what you're making!

tl;dr You're trying to convince VCs when you need to be convincing customers!

(For the record, a lot of what I said here is very money-driven and that's not how I build my company. However, in the context of VCs, which are purely financially driven, it's how you should be thinking about it.)

Good luck, and let me know if I can help! My email is in my profile if you want to talk!


You're not being hurtful. :) I'll send you an email too, but because others are raising this:

Good point and it's a potential trap of tools built from developers for developers.

There's this fine line between finding it useful and being willing to pay for it. Being honest to my self: Do I find this useful ? Yes Makes my web scraping easier? Yes Would I pay for it? Maybe... but the problem is I've never paid for a scraping tool either.

But other people may pay for it. To get to that point though I'll probably need a few more hands on deck, who need salaries and for that you need external funds.


Great! Looking forward to talking :)

If I were you, I'd do everything possible to answer your initial question (is web scraping a viable business) before bringing other people and their money into it. Everything gets much, much harder once other people's time and money is involved (both the process of convincing them, AND the potential process of realizing you were wrong).

There's lots of ways to build a business, so take my advice with a grain of salt. But especially if you're worried about it, I'd highly recommend focusing on getting to a few paying customers before involving VCs or employees. If you're here asking that question, you can be sure they will be too (minus the initial conviction you have).

(The one exception is a co-founder! It's lonely starting a company, so you should definitely consider trying to find a co-founder that you love working with.)


ClearBit and Plaid haven't been mentioned yet – both examples of multi-million/billion dollar business built on the back of scraping.

The more specialized you can get the better chances of success imo. Generic web scrapers are dime-a-dozen


Scraping services on their own are a viable business product, but the power to assign metadata and contextualization is where the unicorn lies.

The service itself will always be in flux because of how freeform hypertext is as a schema. So many other comments here reflect that better than I could.

The fact is that any chunk of data you're handing over to clients still needs to be handled by their team and in my experience reality often falls short of expectations. If you can somehow deliver them something cleaner (or even something that can help them reach conclusions faster), then you have a product with a high value prop.


I went for an interview once at a hedge fund. There were a surprising amount of questions about web scraping. I very much got the feeling it was an active and ongoing problem. So yes I do think there’s a business in there.


AFAIK Plaid does a fair bit of old school scraping behind the scenes - a lot of the "nice new web" is built on the backs of old kludgy websites doing the same scraping things people were doing 20 years ago.

Finance/banks are especially... inconsistent, to say the least.


Having recently worked for a hedge fund I noticed that too - although my idea predated that engagement . That's probably my first go-to market (if anyone has leads send me end email) .

Hedge funds actually call this "alternative data".


If you worked for a hedge fund and you know they purchase scraped data from quite a number of vendors then why are you asking if it’s a viable business model?

Of course it is.


There was a company in my city that did something like this. They didn't survive.

They crawled the data, but also had a services component to do something with the data. Like, they had contracts with pharma companies to search for indications that a page was selling counterfeit drugs.

I'm not sure of the exact details of why they didn't make it.

Also, I'm thinking about Recorded Future (https://www.recordedfuture.com). They do something like this -- again, the mechanics of scraping, and a services component for analysis.


There are many data providers serving the financial services/capital markets sector that provide everything from raw data feeds to insights based off of web-scraped data. What will be important for you is to identify a good niche, understand which part of the data value chain your customer needs, and deliver the data in a way that fits their workflow.

The value is in the decisions that can be made based on the data being sold rather than the method at which you extract it from. If you focus on the value of the data and who needs it, you’ll likely find a viable business model.


I don't wish to hijack this thread, but I've been pondering a similar question. I've been working on a product that requires a very large amount of data that, as far as I can tell, can only be gathered by scraping (real estate data - even data vendors like estated.com don't have stuff like sales data).

Many, many websites contain legal language that forbids automatic data collection/scraping. How can a business be built in such a case?

Perhaps OPs tool only scrapes a select few sites that don't prohibit scraping, but that seems like the exception, not the norm.


Do it manually. I wonder what automatically means legally or scraping. It’s pretty hard to enforce those requirements, because I assume it’s being broken by search providers.

If it’s a derivative work like copilot, I wonder if there’s a legal case to say you can’t do it. I assume you’re doing something like an RSS feed for pricing suggestions with commissions? I just looked this up and it seems like it’s legal to do so but their information is copyrighted. https://law.stackexchange.com/questions/15556/is-scraping-re...


Read up on LinkedIn vs. HiQ. As long as that ruling holds (and it might not), the tl;dr is: If it's on the open web, you can scrape it. You might be violating some Terms of Service (that you never agreed to), but you're not violating (US) law.

If it's NOT on the public web - e.g. it's behind a login, then you can be sued, as you'll have had to explicitly agree to Terms of Service during your account creation and you'll then be in explicit violation of that ToS.


There is a lot of competition when it comes to building a pure-play data scraping company. There are also various regulatory concerns about scraping various types of data--PII like phone numbers, biometric data like images, or data about concert ticket prices.

But I think there is a huge opportunity in scraping data and then doing something interesting with it. Google is the most obvious example of this type of company. But, for example, certain CRM companies are more about data scraping than working with user-provided data.


There is always value in having a store of data that is unique or differentiated in some way. The trick is figuring out what that is and who might be interested.

For a while I ran product for a social monitoring company and our traditional user base was brands, agencies, etc. who would use our giant database of public content to do market research, etc. At various points we would get inbound requests from someone with a unique ask - I recall:

- a military historian working on a government grant who wanted to analyze the social media activities of various militias in a particular part of the world

- several pharma companies looking for adverse drug reaction reports online

- hedge funds looking for deep sentiment trends in particular areas for perception of certain businesses

- some company looking to find properties where women made announcements that they were pregnant.

And then there’s always the requests for X but in Y language/country. “There’s a Twitter like service in Bangladesh, can you get that data?”.

All of these people had money to spend and specific interests - we couldn’t help most of them as the economics didn’t work out in terms of building a scalable business, but if you can find a niche and run things lean, there’s a real long tail of opportunity there.


Hi , I am the ceo of http://webautomation.io the largest marketplace for no-code webscrapers. I can tell you first hand the Market you are going after is very big and getting bigger everday. We get thousands of business's/people sign up everymonth . So my advice is not to worry as I am sure you will find a viable business with your proposed product


Do you guys have an in-house team to deal with legal troubles?


Web-scraped data is worth X.

Indexed & web-scraped data is worth Y.

A searchable index of web-scraped data is worth X^Y.


Are you falling in love with the code?

I think you'll have a lot more luck finding 2-3 initial customers before you try to raise money. It's always easier to explain what your product does, and who the target market is, in terms of actual customers; instead of hypotheticals.

Remember, the goal is to build a business. If you fall in love with the code, it's too easy to build something you enjoy working on, but has limited commercial value.


Am I falling in love with the code? I'd say we're on the 2nd date .... I may even get lucky. But it's enough to make me hang up on recruiters (respectfully).

Good point though. Bootstrapping is a viable option too, but fundraising has certain no non-financial benefits that can be appealing.


I'm not suggesting you bootstrap; I'm suggesting you find and cater to a few customers before you fundraise.

It's very hard to fundraise without some kind of market validation.


Never mind business, a web-scraping command-line utility as comprehensive and easy to use as say curl would be something. I would even pay for that.


Are you looking for output that’s more structured than curl’s or for a way to run curl on silly sites that block you?


My only advice would be have a backup plan (redundancy) . I.e. Design basic version that works and dont get blocked as bot, then design another one. This will save you from situation like described below where your original method stopped working, but your client wants data now (because yjey pay for it). And be nice, dont take data you dont need. Keep it easy on servers.


Scraping isn't the problem, it's getting access to the pages to scrape at high-enough volume, at low-enough cost.


And to answer your question, because my other post was a question itself:

> after a certain point it does require some manual work from the customer

Once you gain traction, you can become a platform, intermediary between customers and engineers that will fine tune scraping to what the customer needs. This could either be some sort of "Solution Engineer" that the company hires, or open it up to outside developers that get paid per integration (either by you or by the customer, or both). There's a solution to every problem.

As far as the business itself, I think you could be on to something. Of course, ideas are cheap and it's the execution that counts, but here's how I'd think about it: with scraping, every website on the web has an API. Before, only 0.1% of websites had an API.

And certainly wouldn't hurt to change the "scraping" word – such an ugly word.


Is Google not just one massive web scraper?


I am the CEO of https://serpapi.com.

Please consider helping us support the EFF actions. Outside the obvious vested interests of our business, I truly believe scraping the web is a force of good and progress. And the EFF work in ensuring web scraping stays a legal practice in the Unite States has been outstanding. [1]

[1] https://www.eff.org/deeplinks/2021/07/eff-ninth-circuit-rece...


I guess that's SERP API. In Spanish, ser papi means "to be a daddy".


Yes, SerpApi. Not the Spanish one!


Not sure if this company survived the pandemic but check out Applaudience it was crawling seat level data from event websites.

https://www.screendaily.com/features/how-uk-data-company-app...

Applaudience’s algorithms trawl through every exhibitor website, looking at every showtime of every film, and tracks the auditorium layout as each seat flips from available (unsold) to unavailable (sold).


It is valuable. There's a lot of Robotic Process Automation, competitive analysis etc...

>the product it's not magic and, after a certain point it does require some manual work from the customer, hence this is an aspect I should prepare for.

Can you make it magic or maybe develop end to end solutions for your first "customers" using your product? That sounds like the schlep you need to do.

Sounds promising! Find yourself a customer or two!

If you really want to go the open source route, just focus on that and then see if people pick it up and use it. Then you'd offer the SaaS.


I saw a project here on HN that was basically a webscraper for non developers that aimed to make it easy to scrape data from various sites. What some warned the project did not do is warn you that you could potentially be banned from some services by using bots and scrapers. I forget the project name but once someone warned it could potentially have your facebook or whatever banned I decided not to try it out. I would warn users be very careful about the TOS for each site you decide to scrape if you are logged in as a user.


Ultimately people want information from scraping, because to a lot of businesses, the information is what's valuable.

Consider, apart from a tool for general purpose scraping, what information a specialized scraper might obtain for a valuable but underserved industry that can profit from the data.

I have a buddy that scrapes data specifically for the tanker/shipping industry, for example.

General purpose scraping will involve a lot of competition and a bit of an arms race. Niche scraping lets you fly a little below the radar.


Pitch your investors the VPN business. That's what web scraping is at scale - a series of networking techniques that allow users to do what they want without being blocked.


A good think to ask yourself is whether there's anything left out there that's both accessible (legally and technically) and worth scraping at the same time.

Search engines deliberately wiped out personal websites, blogs, small news organisations etc, and spammers drowned out the remaining real user generated content from the www. Social media ate the forums and closed the doors.

Websites now aren't websites but businesses and they don't like people snooping around


It will help to recognize the key use-cases and provide lots of support out of the box like pre-built scrapers for price comparison, social media mentions (or other analysis), whatever you find that people will pay for.

Make sure your pricing is clear so the profit calculation for the customer is transparent.

You then have a tangible product line you can pitch to investors regardless of whether they can appreciate the more abstract solution/platform.


If your potential customers are willing to pay to scrape data, why aren't they will to pay for the data from the source directly? Is it not available or is it considered exclusive or proprietary. I'm thinking about the lawsuits around deep linking and TicketMaster. Web scraping at scale is a never ending arms race because designs evolve or the host is actively trying to thwart you.


So for new companies the data (that a lot of time IS publicly available) cost a lot of money, i.e. Api access or charing per requests. A lot of time companies start with scraping and then once they have more customers (and data access price can be shared amongst them) they switch to paying for it.


I like these two questions! They can crystallise things.

Wasn't aware of the Ticketmaster v. Tickets.com case. Will have to print it along the SCOTUS LinkedIn v. hiQ ruling.


Maybe only kinda sorta relevant, but I interned at RefME for a summer - basically a Zotero competitor. There was a lot of value (for users) in scraping web pages to autofill to author, title, date etc so it could generate the references, something I started to work on (but then they got bought out so not sure what ended up happening)

Any I could've imagining this being a paid pro feature down the line


I cofounded a company based on scraping academic publications. We ended up getting a lot of traffic (millions of pageviews per month) because we had good SEO, but it’s not necessarily a defensible business model by itself.

You’ll likely have to do more manual data cleaning than you expect, and get some amount of pushback from the sources you’re crawling (depending how commercially valuable the data is).


Web scraping can be a viable business. It depends on what you're scraping and who your customers are.

Are there a couple thousand people who would pay for a SaaS offering? Then it's a business. The real goal would be identifying a hair on fire problem that you are in a unique position to solve. That's always the problem, and it has nothing to do with web scraping in particular.


I did so for years, scraping university press releases, obscure government data (like the Federal Register), jail/prison rosters, Reddit posts/comments/users etc. Companies like LexusNexus have been doing it for far longer than the digital world, offering it as a clipping service. If you can find a niche, it can be a regular subscription income stream.


We tried https://techcrunch.com/2021/06/14/supreme-court-revives-link...

(I was CTO of hiQ (technically still am I think))

contact me if you want to chat about scraping danbmil99 at gmail


I’m curious how you deal with JavaScript that will load other pages including other JavaScript documents that cannot be loaded until the first set of JavaScript is executed. I’ve played with the chromium web driver a few times but it seems to be tricky to implement in a completely headless environment.


Sometimes you need custom driver or taking control of real browser (and i mean real browser, not via webdriver), headless is easily detected.


For most use cases, headless chromium will work out of the box. For the rest, set $DISPLAY to a virtual framebuffer like Xvfb


If you can apply the scraping to a use case and sell that use case you have a better chance at a viable business model. Examples come to my mind is; builtwith (scrape sites and publish the list of technologies they use), ahrefs (scrape sites and find outgoing/incoming links) etc.


If you’re scraping someone else’s data, do you know what they copyright status is? Have you made deals with the original sources that permits you to use their data? How will you deal with lawsuits and the constant blocking of your scrapers?


Amd what if youre scraping data thats... already being scraped? :D


I think yes, web scraping can give a viable product. Now with the ability to scrape react/js pages and availability of 24GB cloud free tier machines and transformer models - i think atleast for the next 5 years, should be possible !


Don't know if this gets to the heart of your question, but I was surprised to not have seen https://www.scrapingbee.com/ mentioned here yet.


An online law editorial I used to work at around 10 years ago basically scraped free available content from different sources around the web, repackaged it, and sold it online via subscription model. They're still around.


You might want to look at web scraping for data scientists. I am trying to build a ML Model for NSFW text detection in multiple languages and I am not looking forward to scraping p*rn and youtube websites for comments.


Could you expand on this a little? What’s the problem and what are you looking for as a solution?


Humans submit NSFW text content to platforms that are no intended to host such content (like chats in video games or reviews for products). It is typically too expensive for a company to hire humans to review all text content, so they may want an ML model that can help them identify the unwanted text.


You probably need NSFW text which is its own niche and should be straight forward to collect?


Web scraping on demand: https://zvelo.com/ in the service of adtech, ofc.

Re: URL Database for Brand Safety & Contextual Targeting


Instead of competing head-on with ScrapingHub/etc, you could ask yourself “why companies doing X pay them” and sell a specific product for that niche.


In my opinion you should start with defining your value proposition and target market, i.e. how do I create value for customers, and who are my customers?


Isn't one big issue that any website that has data worth scraping has ToS that disallow scraping? e.g. craigslist?

Would think this is a legal concern...


Also it's very easy to get blacklisted if you scrape too often or too quickly. You definitely have to be "polite" about it.


What about government/municipality data?

We paid for that once, surely it shouldn't be necessary to charge for it again?


Can breaking terms of service result in anything more than being banned from the site you are scraping?


It is for openexchangerates.org

Makes the founder a hefty sum.


Yes! Everybody scrapes everybody. The difference lies on what they do with the data afterwards.


Another example is Zillow which scrapes public records. Public records have some advantages.


Of course. See Yodlee and Plaid.


Yodlee, for certain.


Yes if you point it at a well defined target market and create a solution.


Scraping user content or recipes should be fine.


Yes.


Point them to the disgusting but successful pracrices of ClearviewAI. That should do it.


Please, see https://www.datallog.com/

They are a profitable startup and have a SDK solution for scale web scraping bots on many business fields.

I'm a advisor/investor in Datallog. You can connect with the founder in https://www.linkedin.com/in/joelder-maragno-arcaro/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: