TOWARDSAI.NET
Llama 3.2 Vision Review
Author(s): Barhoumi Mosbeh Originally published on Towards AI. Ollama multi-modalOllama has just announced its official support for the Llama 3.2 Vision models. The Llama 3.2 Vision models come in two sizes: 11 billion and 90 billion parameters. In this article, I will provide an overview of their performance in different cases, along with my personal thoughts on them.The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.Also the model supports many languages:For text only tasks, English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Note for image+text applications, English is the only language supported.EvaluationI will use the approach from this paper for evaluation, which actually argues that vision-language models are blind. It tests models on several tasks:VLMs are BlindResearch showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.vlmsareblind.github.ioTask 1: Line IntersectionsTask 2: Two CirclesTask 3: Circled LetterTask 4: Overlapping ShapesTask 5: Nested SquaresTask 6: Counting GridTask 7: Subway MapI will choose the most challenging parts from each task and see if the model can get any of them right. Additionally, I will run the code around 1020 times to calculate the average of accurate responses and identify any false results.Task 1: Line IntersectionsPrompts:How many times do the blue and red lines touch each other? Answer with a number in curly brackets, e.g., {5}.Count the intersection points where the blue and red lines meet. Put your answer in curly brackets, e.g., {2}.The blue and red lines intersect at 2 points. Therefore, they touch each other twice. {2}The blue and red lines touch each other at three points. Therefore, the answer is {3}.Success rate > 60%Task 2: Two CirclesPrompts:Are the two circles touching each other? Answer with Yes/No.Are the two circles overlapping? Answer with Yes/No.**Step 1: Identify the Circles**The image contains a blue circle and a red circle.**Step 2: Determine if the Circles are Touching Each Other**To determine if the circles are touching each other, we need to check if they intersect at exactly one point or not. If they do not intersect at all, then they are not touching. If they intersect at more than one point, then they are overlapping but not necessarily touching.**Step 3: Check for Intersection Points**Upon examining the image closely, it appears that the two circles intersect at exactly one point. This means that the circles are indeed touching each other.**Conclusion:**YesThe image depicts a diagram with two circles, one red and one blue. The red circle is positioned on the left side of the diagram, while the blue circle is located on the right side.To determine whether the two circles are touching each other, we need to analyze their positions and sizes. Upon closer inspection, it becomes apparent that the two circles do not overlap or touch at any point. Instead, they appear to be separated by a small gap between them.Therefore, based on this analysis, **No**, the two circles are not touching each other.Success rate > 80%Task 3: The circled letterPrompts:Which letter is being circled?Which character is being highlighted with a red oval?**Answer:** The letter "n" is highlighted in red and encircled.Success rate: 100%Counting overlapping shapesPrompts:How many {shapes} are in the image? Answer with only the number in numerical format.Count the {shapes} in the image. Answer with a number in curly brackets e.g. {3}.The model answered with either 6, 7, or 8 but never gave the correct number, which is 9. And guess what? Even GPT-4 cant do it!Success rate: 0%Following single-colored pathsPrompts:How many single-colored paths go from A to C? Answer with a number in curly brackets, e.g., {3}Count the one-colored routes that go from A to C. Answer with a number in curly brackets, e.g., {3}.To find the number of single-colored paths from point A to point C, we need to count the distinct paths that only use one color throughout their entire length.**Step 1: Identify the colors used on the graph**Looking at the graph provided, we can see that there are four different colors used:* Red* Blue* Green* PurpleHowever, upon closer inspection, it becomes apparent that not all these colors are utilized in creating paths from A to C. We focus only on those colors that lead directly or indirectly from A to C without interruption.**Step 2: Determine the relevant paths**Upon examining the graph closely for paths using a single color from A to C:* There is one path going directly from A to B (using red) and then continuing in another segment. {* Another path leads from A to D and then to C, but this involves two colors.Given this analysis, only one path utilizes a single color continuously from A to C.**Conclusion**Therefore, the number of single-colored paths from point A to point C is1}.Answer: {1}.To find the number of single-colored paths from point A to point C, we need to analyze the given graph and count the distinct paths that meet this criterion.**Step 1: Identify Single-Colored Paths**Single-colored paths are those where all edges belong to a single color. In this context, a path is considered single-colored if it consists entirely of red edges.**Step 2: Count Distinct Red Edges from A to C**From point A to point C, there are two distinct red edges that form a direct path:- One edge connects A directly to C.- Another set of edges forms a longer path through intermediate nodes but ultimately leads back to the same conclusion regarding single-color paths.**Step 3: Consideration of Alternative Paths**Upon closer inspection, it becomes apparent that while there are multiple paths from A to C, only two of these paths consist entirely of red edges. This is because any path involving more than one color would not be considered "single-colored."**Conclusion**Given the analysis above, the correct answer reflects the count of single-colored paths (all red) from point A to point C. {Answer:2}Success rate > 60%Both GPT-4o and Claude 3.5 Sonnet failed in this task.After playing with this model for a while, I dont see a significant difference; it can be compared to GPT-4o and Sonnet-3. However, its still not as good as Gemini-1.5 Pro, which I believe might be the best VLM right now. Keep in mind that this is just the 11B version of the model. If you have enough resources to test the 90B version, simply replace 11B with 90B in the following code.Try it yourselfInstall Ollama!curl -fsSL https://ollama.com/install.sh | shStart ollamaimport osget_ipython().system = os.system!ollama serve &Pull the model!ollama pull llama3.2-vision Try itimport ollamadef extract_document_info(image_path): response = ollama.chat( model='llama3.2-vision', messages=[{ 'role': 'user', 'content': "your question about the image" ), 'images': [image_path] }] ) return response#exampleimage_path ="/kaggle/input/image.jpg"result = extract_document_info(image_path)print(result["message"]["content"])Thats it! This was a quick review of LLaMA 3.2 Vision (11B version). Its actually good at OCR too, I tested it on some really challenging images. The 90B version is better, of course, but it requires substantial resources to run. Once I find a good resource, Ill attach it!llama3.2-vision:90bLlama 3.2 Vision is a collection of instruction-tuned image reasoning generative models in 11B and 90B sizes.ollama.comVLMs are BlindResearch showing that vision language models (VLMs) fail on simple visual tasks that are easy for humans.vlmsareblind.github.ioJoin thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI
0 Σχόλια
0 Μοιράστηκε
22 Views