Web-scraping AI bots cause disruption for scientific databases and journals
Automated programs gathering training data for artificial-intelligence tools are overwhelming academic websites.
In February, the online image repository DiscoverLife, which contains nearly three million photographs of different species, started to receive millions of hits to its website every day — a much higher volume than normal. At times, this spike in traffic was so high that it slowed the site down to the point that it became unusable. The culprit? Bots.
These automated programs, which attempt to ‘scrape’ large amounts of content from websites, are increasingly becoming a headache for scholarly publishers and researchers who run sites hosting journal papers, databases and other resources.
volume of requests” to access a website, “which is causing strain on their systems. It costs money and causes disruption to genuine users.”
A flood of bots
Internet bots have been around for decades, and some have been useful. For example, Google and other search engines have bots that scan millions of web pages to identify and retrieve content. But the rise of generative AI has led to a deluge of bots, including many ‘bad’ ones that scrape without permission.
This year, the BMJ, a publisher of medical journals based in London, has seen bot traffic to its websites surpass that of real users. The aggressive behaviour of these bots overloaded the publisher’s servers and led to interruptions in services for legitimate customers, says Ian Mulvany, BMJ’s chief technology officer.
which roughly two-thirds had experienced service disruptions as a result. “Repositories are open access, so in a sense, we welcome the reuse of the contents,” says Kathleen Shearer, COAR’s executive director. “But some of these bots are super aggressive, and it’s leading to service outages and significant operational problems.”
More info:
Website Link: https://databasescientist.org/
Contact Us: contact@databasescientist.org
Comments
Post a Comment