That may have been the cry until earlier this month when the latest appeals court decision that web scraping doesn’t break anti-trust laws. LinkedIn lost its two-year legal battle with a private company that it had blocked from its site for allegedly stealing publicly available data from its website.
How Does It Affect My Organization?
The most susceptible organizations appear to be those with memberships and sites that allow access to proprietary information. Web scrapers can sign up, pay a fee, if necessary, and harvest any information available.
In addition, any sites that request personal information, from Facebook to Amazon, Craigslist to YouTube, and Wikipedia, are apparently vulnerable to web scrapers who can purchase the software on the web.
Some experts suggest that the bots used in web scraping can even extract non-pubic data from sites, especially those that compare prices, goods and other things. Sometimes bots used to web scrape make too many requests without pausing, This usually results in a shutdown and denial of service response shutting off access to all users.
If It’s Legal, What’s the Harm?
Most of the bots used in web scraping gather data that can be used to benefit its user. These include price comparisons, research, product data, web content, and customer/sales leads, to name a few.
What Can I Do?
Stopping bots and web scrapers may not be 100% possible but there are some things you can do to decrease the odds or minimize damage.
Using CAPTCHA or Completely Automated Turing Test to Tell Computers and Humans Apart can help. Those are the signs or pictures you sometimes encounter on sites asking you to identify certain objects or things before allowing you to complete your query.
Using video, pdf and images is not only useful to visitors but can stump web scrapers. Most bots are looking for text and miss this content. Consider requiring visitors to log into your site. It may not stop a web scraper, but the requirement will force the bot to enter information that will likely enable you to track down the source.
Talk with your IT experts about possibly blocking requests from computers that come in much faster than individuals. However, also be aware than some VPNs and web servers may show all traffic coming from the same address and could also be blocked. If so, are you willing to risk that?
As obvious as it is, don’t post anything on your website that you don’t want to be copied or repurposed. If you’re with an organization where information about members is available to other paid members, this is a fact-of-life you have to deal with.
The only real preventative maintenance you can do is implement deeper vetting and screening of membership applications and require those joining to adhere to privacy rules that prohibit the commercial collection and use of member information. Of course, there isn’t much you can do after the fact if someone violates the restriction, but you’ll at least know the source and be able to take appropriate action.
Because of different rulings involving similar cases in two other courts, many observers believe this issue may be taken to the U.S. Supreme Court. Stay tuned!