An ambiguous metropolis road, a freshly mown area, and a parked armoured automobile had been among the many instance images we selected to problem Massive Language Fashions (LLMs) from OpenAI, Google, Anthropic, Mistral and xAI to geolocate.
Again in July 2023, Bellingcat analysed the geolocation efficiency of OpenAI and Google’s fashions. Each chatbots struggled to determine photos and had been extremely vulnerable to hallucinations. Nonetheless, since then, such fashions have quickly advanced.
To evaluate how LLMs from OpenAI, Google, Anthropic, Mistral and xAI evaluate at the moment, we ran 500 geolocation checks, with 20 fashions every analysing the identical set of 25 photos.
Our evaluation included older and “deep analysis” variations of the fashions, to trace how their geolocation capabilities have developed over time. We additionally included Google Lens to match whether or not LLMs provide a real enchancment over conventional reverse picture search. Whereas reverse picture search instruments work in another way from LLMs, they continue to be one of the crucial efficient methods to slim down a picture’s location when ranging from scratch.
The Check
We used 25 of our personal journey images, to check a spread of out of doors scenes, each rural and concrete areas, with and with out identifiable landmarks similar to buildings, mountains, indicators or roads. These photos had been sourced from each continent, together with Antarctica.
The overwhelming majority haven’t been reproduced right here, as we intend to proceed utilizing them to judge newer fashions as they’re launched. Publishing them right here would compromise the integrity of future checks.
Every LLM was given a photograph that had not been printed on-line and contained no metadata. All fashions then acquired the identical immediate: “The place was this photograph taken?”, alongside the picture. If an LLM requested for extra data, the response was an identical: “There isn’t a supporting data. Use this photograph alone.”
We examined the next fashions:
Developer | Mannequin | Developer’s Description |
Anthropic | Claude Haiku 3.5 | “quickest mannequin for day by day duties” |
Claude Sonnet 3.7 | “our most clever mannequin but” | |
Claude Sonnet 3.7 (prolonged pondering) | “enhanced reasoning capabilities for complicated duties” | |
Claude Sonnet 4.0 | “sensible, environment friendly mannequin for on a regular basis use” | |
Claude Opus 4.0 | “highly effective, giant mannequin for complicated challenges” | |
Gemini 2.0 Flash | “for on a regular basis duties plus extra options” | |
Gemini 2.5 Flash | “makes use of superior reasoning” | |
Gemini 2.5 Professional | “finest for complicated duties” | |
Gemini Deep Analysis | “get in-depth solutions” | |
Mistral | Pixtral Massive | “frontier-level picture understanding” |
OpenAI | ChatGPT 4o | “nice for many duties” |
ChatGPT Deep Analysis | “designed to carry out in-depth, multi-step analysis utilizing knowledge on the general public internet” | |
ChatGPT 4.5 | “good for writing and exploring concepts” | |
ChatGPT o3 | “makes use of superior reasoning” | |
ChatGPT o4-mini | “quickest at superior reasoning” | |
ChatGPT o4-mini-high | “nice at coding and visible reasoning” | |
xAI | Grok 3 | “smartest” |
Grok 3 DeepSearch | “superior search and reasoning” | |
Grok 3 DeeperSearch | “prolonged search, extra reasoning” |
This was not a complete assessment of all obtainable fashions, partly because of the pace at which new fashions and variations are presently being launched. For instance, we didn’t assess DeepSeek, because it presently solely extracts textual content from photos. Be aware that in ChatGPT, no matter what mannequin you choose, the “deep analysis” perform is presently powered by a model of o4-mini.
Gemini fashions have been launched in “preview” and “experimental” codecs, in addition to dated variations like “03-25” and “05-06”. To maintain the comparisons manageable, we grouped these variants underneath their respective base fashions, e.g. “Gemini 2.5 Professional”.
We additionally in contrast each take a look at with the primary 10 outcomes from Google Lens’s “visible match” characteristic, to evaluate the problem of the checks and the usefulness of LLMs in fixing them.
We ranked all responses on a scale from 0 to 10, with 10 indicating an correct and particular identification, similar to a neighbourhood, path, or landmark, and 0 indicating no try and determine the situation in any respect.
And the Winner is…
ChatGPT beat Google Lens.
In our checks, ChatGPT o3, o4-mini, and o4-mini-high had been the one fashions to outperform Google Lens in figuring out the proper location, although not by a big margin. All different fashions had been much less efficient when it got here to geolocating our take a look at images.
We scored 20 fashions towards 25 images, ranking every from 0 (pink) to 10 (darkish inexperienced) for accuracy in geolocating the photographs.
Even Google’s personal LLM, Gemini, fared worse than Google Lens. Surprisingly, it additionally scored decrease than xAI’s Grok, regardless of Grok’s well-documented tendency to hallucinate. Gemini’s Deep Analysis mode scored roughly the identical because the three Grok fashions we examined, with DeeperSearch proving the simplest of xAI’s LLMs.
The best-scoring fashions from Anthropic and Mistral lagged effectively behind their present rivals from OpenAI, Google, and xAI. In a number of instances, even Claude’s most superior fashions recognized solely the continent, whereas others had been capable of slim their responses all the way down to particular components of a metropolis. The newest Claude mannequin, Opus 4, carried out at an analogous stage to Gemini 2.5 Professional.
Listed below are a number of the highlights from 5 of our checks.
A Highway within the Japanese Mountains
The photograph under was taken on the street between Takayama and Shirakawa in Japan. In addition to the street and mountains, indicators and buildings are additionally seen.
Gemini 2.5 Professional’s response was not helpful. It talked about Japan, but in addition Europe, North and South America and Asia. It replied:
“With none clear, identifiable landmarks, distinctive signage in a recognisable language, or distinctive architectural types, it’s very troublesome to find out the precise nation or particular location.”
In distinction, o3 recognized each the architectural fashion and signage, responding:
“Finest guess: a snowy mountain stretch of central-Honshu, Japan—someplace within the Nagano/Toyama space. (Japanese-style homes, kanji on the billboard, and typical expressway boundaries give it away.)”
A Discipline on the Swiss Plateau
This photograph was taken close to Zurich. It confirmed no simply recognisable options other than the mountains within the distance. A reverse picture search utilizing Google Lens didn’t instantly result in Zurich. With none context, figuring out the situation of this photograph manually might take a while. So how did the LLMs fare?
Gemini 2.5 Professional acknowledged that the photograph confirmed surroundings frequent to many components of the world and that it couldn’t slim it down with out further context.
Against this, ChatGPT excelled at this take a look at. o4-mini recognized the “Jura foothills in northern Switzerland”, whereas o4-mini-high positioned the scene ”between Zürich and the Jura mountains”.
These solutions stood in stark distinction to these from Grok Deep Analysis, which, regardless of the seen mountains, confidently acknowledged the photograph was taken within the Netherlands. This conclusion gave the impression to be based mostly on the Dutch identify of the account used, “Foeke Postma”, with the mannequin assuming the photograph should have been taken there and calling it a “cheap and well-supported inference”.
An Internal-Metropolis Alley Filled with Visible Clues in Singapore
This photograph of a slim alleyway on Round Highway in Singapore provoked a variety of responses from the LLMs and Google Lens, with scores starting from 3 (close by nation) to 10 (right location).
The take a look at served as a great instance of how LLMs can outperform Google Lens by specializing in small particulars in a photograph to determine the precise location. People who answered appropriately referenced the writing on the mailbox on the left within the foreground, which revealed the exact deal with.
Whereas Google Lens returned outcomes from throughout Singapore and Malaysia, a part of ChatGPT o4-mini’s response learn: “This seems to be a traditional Singapore shophouse arcade – actually, if you happen to have a look at the mailboxes on the left you possibly can simply make out the label ‘[correct address].’”
A number of the different fashions seen the mailbox however couldn’t learn the deal with seen within the picture, falsely inferring that it pointed to different places. Gemini 2.5 Flash responded, “The design of the mailboxes on the left, significantly the ‘G’ for Geylang, factors strongly in the direction of Singapore.” One other Gemini mannequin, 2.5 Professional, noticed the mailbox however targeted as a substitute on what it interpreted as Thai script on a storefront, confidently answering: “The visible proof strongly suggests the photograph was taken in an alleyway in Thailand, possible in Bangkok.”
The Costa Rican Coast
One of many more durable checks we gave the fashions to geolocate was a photograph taken from Playa Longosta on the Pacific Coast of Costa Rica close to Tamarindo.
Gemini and Claude carried out the worst on this activity, with most fashions both declining to guess or giving incorrect solutions. Claude 3.7 Sonnet appropriately recognized Costa Rica however hedged with different places, similar to Southeast Asia. Grok was the one mannequin to guess the precise location appropriately, whereas a number of ChatGPT fashions (Deep Analysis, o3 and the o4-minis) guessed inside 160km of the seashore.
An Armoured Automobile on the Streets of Beirut
This photograph was taken on the streets of Beirut and options a number of particulars helpful for geolocation, together with an emblem on the aspect of the armored personnel service and {a partially} seen Lebanese flag within the background.
Surprisingly, most fashions struggled with this take a look at: Claude 4 Opus, billed as a “highly effective, giant mannequin for complicated challenges”, guessed “someplace in Europe” owing to the “European-style road furnishings and constructing design”, whereas Gemini and Grok might solely slim the situation all the way down to Lebanon. Half of the ChatGPT fashions responded with Beirut. Solely two fashions, each ChatGPT, referenced the flag.
So Have LLMs Lastly Mastered Geolocation?
LLMs can actually assist researchers to identify the small print that Google Lens or they themselves would possibly miss.
One clear benefit of LLMs is their capacity to look in a number of languages. In addition they
seem to make good use of small clues, similar to vegetation, architectural types or signage. In a single take a look at, a photograph of a person sporting a life vest in entrance of a mountain vary was appropriately positioned as a result of the mannequin recognized a part of an organization identify on his vest and linked it to a close-by boat tour operator.
For touristic areas and scenic landscapes, Google Lens nonetheless outperformed most fashions. When proven a photograph of Schluchsee lake within the Black Forest, Germany, Google Lens returned it as the highest end result, whereas ChatGPT was the one LLM to appropriately determine the lake’s identify. In distinction, in city settings, LLMs excelled at cross-referencing delicate particulars, whereas Google Lens tended to fixate on bigger, similar-looking buildings, similar to buildings or ferris wheels, which seem in lots of different places.
Warmth map to point out how every mannequin carried out on all 25 checks
Enhanced Reasoning Modes
You’d assume turning on “deep analysis” or “prolonged pondering” features would have resulted in greater scores. Nonetheless, on common, Claude and ChatGPT carried out worse. Just one Grok mannequin, DeeperSearch, and one Gemini, Gemini Deep Analysis, confirmed enchancment. For instance, o3 was proven a photograph of a shoreline and took almost 13 minutes to provide a solution that was about 50km north of the proper location. In the meantime, o4-mini-high responded in simply 39 seconds and gave a solution 15km nearer.
Total, Gemini was extra cautious than ChatGPT, however Claude was essentially the most cautious of all. Claude’s “prolonged pondering” mode made Sonnet much more conservative than the usual model. In some instances, the common mannequin would hazard a guess, albeit hedged in probabilistic phrases, whereas with “prolonged pondering” enabled for a similar take a look at, it both declined to guess or provided solely imprecise, region-level responses.
LLMs Proceed to Hallucinate
All of the fashions, sooner or later, returned solutions that had been completely flawed. ChatGPT was sometimes extra assured than Gemini, typically main to raised solutions, but in addition extra hallucinations.
The danger of hallucinations elevated when the surroundings was momentary or had modified over time. In a single take a look at, as an illustration, a seashore photograph confirmed a big lodge and a brief ferris wheel (put in in 2024 and dismantled throughout winter). Most of the fashions persistently pointed to a special, extra incessantly photographed seashore with an analogous journey, regardless of clear variations.
Ultimate Suggestions
Your account and immediate historical past could bias outcomes. In a single case, when analysing a photograph taken within the Coral Pink Sand Dunes State Park, Utah, ChatGPT o4-mini referenced earlier conversations with the account holder: “The person talked about Durango and Colorado earlier, so I believe they could have posted a photograph from a earlier journey.”
Equally, Grok appeared to attract on a person’s Twitter profile, and previous tweets, even with out express prompts to take action.
Video comprehension additionally stays restricted. Most LLMs can not seek for or watch video content material, chopping off a wealthy supply of location knowledge. In addition they battle with coordinates, typically returning tough or just incorrect responses.
Finally, LLMs are not any silver bullet. They nonetheless hallucinate, and when a photograph lacks element, geolocating it can nonetheless be troublesome. That mentioned, in contrast to our managed checks, real-world investigations sometimes contain further context. Whereas Google Lens accepts solely key phrases, LLMs could be provided with far richer data, making them extra adaptable.
There’s little doubt, on the price they’re evolving, LLMs will proceed to play an more and more important position in open supply analysis. And as newer fashions emerge, we’ll proceed to check them.
Infographics by Logan Williams and Merel Zoet