Report: AI start-up Anthropic is said to be accessing data without permission

0
16
Report: AI start-up Anthropic is said to be accessing data without permission


AI startup Anthropic has been accused of aggressively collecting data from websites to train its AI systems, potentially violating publishers’ terms of service, British daily Financial Times reported.

Advertisement


Companies like Anthropic and OpenAI train their large generative AI language models with massive amounts of data from a variety of sources. Anthropic’s AI chatbot cloud, which rivals OpenAI’s ChatGPT, can respond to a range of natural language prompts. Founded by a group of former OpenAI employees, Anthropic’s stated goal is to “responsibly develop and maintain advanced AI for the long-term benefit of humanity.”

US border: Cell phone searches only with judicial authorizationUS border: Cell phone searches only with judicial authorization

But the San Francisco-based company doesn’t always seem to live up to that claim. At least if you believe Matt Barrie, CEO of Freelancer.com, an online job board where millions of freelancers offer their services. According to the Financial Times Barry accused Anthropic of being “the most aggressive scraper” of his web portal ever.

According to the report, other web publishers have also accused Anthropic of collecting data from their websites and ignoring their instructions to stop collecting their content. Freelancer.com received 3.5 million visits within four hours from a “web crawler” associated with Anthropic, the Financial Times wrote, citing data available to it. Barry told the newspaper that Freelancer.com tried to deny its access requests using standard web protocols to control crawlers, yet the visits continued to increase. He then decided to completely block traffic from Anthropic’s Internet addresses.

“We had to block them (Anthropic, note) because they do not follow the rules of the Internet,” Barry said. “This is serious scraping that slows down the site for everyone working on it and ultimately hurts our revenue.” Anthropic said he is investigating the matter.

Kyle Wiens, managing director of iFixit.com, a website that offers repair instructions, made similar allegations to the Financial Times. The site received one million hits from Anthropic bots within 24 hours. iFixit’s terms of service prohibit the use of its data for machine learning, Wiens said. “My first message to Anthropic is: If you use this data to train your models, that’s illegal. My second is: This is not polite internet behavior. Crawling is a matter of courtesy.” Websites use a protocol called robots.txt to keep crawlers and other web robots out. The robot exclusion standard governs who is allowed to automatically browse website content – this is very topical and a frequent subject of conflict in the time of AI chatbots like ChatGPT.

Data scraping is not a new practice, but it has increased dramatically in the last two years as a result of the AI ​​arms race. “Search engines have always done a lot of scraping, but the training of generative AI has taken it to a whole new level,” says Barry. Leading AI companies are competing to develop increasingly powerful and sophisticated language models and require huge amounts of data to do so. This also raises the question of copyright and the use of data for training models. Companies like OpenAI or X repeatedly collect data for AI training without asking. The head of Microsoft AI, Mustafa Suleyman, just pointed out that there is a social contract that allows the use of content on the internet – including for AI training. He faced a lot of opposition.

Companies are defending themselves in different ways. Reddit has begun blocking various search engines and their web crawlers if they do not agree to a licensing agreement with the online platform. The legal dispute between the American daily newspaper The New York Times and OpenAI is drawing a lot of attention. The newspaper has accused OpenAI of violating copyright law by using thousands of articles to train its language models – and thus building business at the newspaper’s expense. She insists on compensation. In May, Open AI reached an agreement with News Corp., one of the world’s largest publishing houses, which includes papers such as The Wall Street Journal, The New York Post, The Sunday Times and The Daily Telegraph. OpenAI has secured access to all the content of the affiliated newspapers. Other media companies, such as the Reuters agency, are now licensing their content for AI training.


(AKN)

EU Commission invites tenders for development of DSA alarm systemEU Commission invites tenders for development of DSA alarm system

LEAVE A REPLY

Please enter your comment!
Please enter your name here