conclusion
A number of bigger conclusions may be drawn from this take a look at case. Two fashions extracted from a specific database of experimental literature, NotebookLM and a custom-built software, carried out higher than LLM skilled on unfiltered Web knowledge. Particularly, fashions counting on open internet sources tended to be a mixture of established and extremely speculative theories.
The assessed LLM (accessed December 2024) additionally confirmed weaknesses in temporal and contextual understanding. For instance, they typically failed to acknowledge {that a} proposed speculation was later disproved. We additionally typically omitted related papers if they didn’t explicitly embrace the precise language used within the preliminary question.
Our outcomes broadly spotlight the necessity for LLMs to higher perceive tables and pictures as a result of heavy use of those codecs in scientific papers. Two of the fashions persistently referenced photos, however extra typically relied on picture captions relatively than visible evaluation. Enhanced visible reasoning capabilities, comparable to picture, plot, and scale bar interpretation, are the principle instructions for future enhancements.


