• How to read LLM benchmarks
    uxdesign.cc
    And why you shouldnt trust themblindlySource: Anthropics Claude 3.5 Sonnet blogpostDisclaimer: The opinions stated here are my own, not necessarily those of my employer.Every once in a while, theres an announcement about a new model. Its always better than its predecessor and its also better than all of the other frontier models in the market. The announcement comes with a table that looks likethis:The high-level message delivered here is we are better than everyone else at almost everything.But how exactly is this claim made? What do these numbers mean? Can you take them at face value? Lets break itdown.Why BenchmarksImagine youre selling a car and you want to claim that its the best car on the market. However, potential buyers look for different featuressome want the safest car, while others want the fastest. To convince the broadest audience to choose your car, you compare it against competitors using universally understood data: Safety rating, Fuel efficiency, 0-to-60 time,etc.LLM Benchmarks serve a similar purpose. They are standardized tests and datasets designed to evaluate the performance of models across various tasks. They provide metrics and criteria to compare different models, ensuring consistency and objectivity in assessments.How Benchmarks WorkEach Benchmark evaluates a capability that the LLM might be used for. HumanEval, for example, tests the models ability to write code. It consists of a set of 164 programming challenges (ex: finding a substring within a large string) and uses unit tests to check the functional correctness of generated code.Another example is Reasoning, which can be defined in different ways. For the purposes of benchmarking, its defined as the ability to answer hard, complex questions that require step-by-step deduction and analyzing data. Heres an example: Two quantum states with energies E1 and E2 have a lifetime of 10^-9 sec and 10^-8 sec, respectively. We want to clearly distinguish these two energy levels. Which one of the following options could be their energy difference so that they be clearly resolved? This question will sound hard unless youre a physicist (or if you enjoy devouring books about quantum mechanics for some reason). It was from the GPQA benchmark, which has 448 such questions across different fields. Models receive a score based on how many questions they answer correctly.Other tests include Language understanding (MMLU) and Math problem solving (MATH). They are similar tests with other types of questions. But within each test, its the same set of questions that every model is evaluated against. This is how consistency is maintained (not unlike the idea of humans taking standardized tests).CoT &Few-shotZooming into the same ClaudetableFew-shot (like 3-shot) refers to the amount of examples that were given to the model to better understand the task. 0-shot means no examples were given. CoT refers to Chain-of-Thought, where the model is asked to explain its reasoning process. CoT and examples can help improve response quality for certain tasks, which is why they are separately highlighted in benchmark results. Heres an example I got from ChatGPT to explainCoT:The problem with these BenchmarksLack of transparencyWe dont know how a model was trained. We dont know how the benchmark tests were run. Then how can we say for sure that the model was not trained on the testing data? This issue is called contamination, which is a common problem in Machine Learning.The sheer amount of data that LLMs are trained on has made it impossibly hard to detect or avoid this problem. Its the human equivalent of finding out all the questions before the day of theexam.Do these exams really measure ability or intelligence?Its likely that a significant portion of benchmark results can be explained by the ability of LLMs to memorize vast amounts of data. Then, are they really more capable than humans just because they got a better score? Heres another exampleThere was news last year of ChatGPT acing the LSAT. Its a pretty impressive feat, except when you think about (a) The fact that the LSAT usually contains questions from previous years and (b) How LSAT questions from previous years are freely available all over the internet. If ChatGPT aced the LSAT after seeing the questions before the test, would you still replace your lawyer withit?How to choose foryourselfIf youre a developer or a team trying to use AI in your product, you need to construct your own evaluation. This evaluation needs to be focused on the use cases that matter to you. The dataset needs to be customized based on your requirements.An exampleIf you want to automate customer service for your business, build a dataset of questions that your customers might ask. Then, build a system to prompt any LLM with these questions and score the answers they give you. Run this activity between the models you are considering and then make yourchoice.If youre an individual user looking to decide if you need to switch between Claude and ChatGPT, you can build a set of your most commonly used prompts (write a cover letter, generate an image of, etc) and compare the different responses before making your decision. Even if the results arent statistically significant (unless you ask a lot of questions and repeat this process multiple times), its a controllable system that can explain your decision making. IMO, its much better than opaque benchmarking processes run by companies that are selling models toyou.Further reading on running evaluations can be found here. Better writing on the problems with benchmarks can be found here andhere.Disclaimer: The opinions stated here are my own, not necessarily those of my employer.Please consider subscribing to my substack if you liked this article. Thankyou!How to read LLM benchmarks was originally published in UX Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.
    0 Σχόλια ·0 Μοιράστηκε ·142 Views
  • Google Maps Best Feature Is About to Get a Lot Less Useful
    lifehacker.com
    Earlier this year, Google announced a big change for the Timeline feature in Google Maps, which happens to be one of its best features. While it can currently be accessed via Google Maps on the web as well as through the mobile apps, soon it will be locked to individual phones and tablets (and unless you transfer it over, all your existing data will be wiped).The deadline for the switch seems to vary between users. Google told TechRadar that people would have "approximately six months" from when they were first notified about the changeover to get their data transferred, but 9to5Google reports many Timeline users are seeing a new deadline of June 9, 2025. If you're affected, your best option is just to open up Google Maps on your phone and see what it says.While I'm glad the Timeline feature is sticking around in some form, and I understand the privacy and security benefits of this data being stored on devices rather than in the Google cloud, I'm sad to see the web interface going away. Taking a trip with Google Timeline The web interface gives you overviews of months and years. Credit: Lifehacker If you're new to Timeline, here's how it works: The feature basically logs everywhere you go, automatically, using location information from devices linked to your Google Account. Previously you could later access this travel history via the web using your Google account. I get why that's a privacy concern for peopleperhaps the reason Google is going to lock this data locally to specific devicesbut I'm prepared to trust Google to keep my comings and goings safely hidden from anyone else, because of how useful and interesting I find the feature.I often use it to retrace the steps of past vacations, recalling the places we stopped at and the sights we saw. I also use it to look up bars, restaurants, and coffee shops I've liked in the past. Sometimes I'll take a random dip into history, to see what I was doing this day last year, or this day five years ago. Everything is mapped out: Not just places, but also journeys by plane, train, or automobile (or foot).In some ways it works like a journal I don't need to remember to update, making entries multiple times each day. I can look back on when I last saw friends in particular places, or last visited certain countries. What's more, Timeline keeps track of the cities I've seen and the miles I've covered, almost like fitness stats but for travel.Timeline will continue to offer this on phones and tablets, but there will be no syncing across multiple devices. I won't be able to load up my maps and my travels on a big screenwhen it comes to viewing the map, looking up individual places, editing information, and scrolling through dates, it's all much easier with a trackpad and keyboard.The web interface is currently also the only way to see an overview of an entire month, or an entire yearthe Google Maps app only shows data for one specific day at a time. I'll no longer be able to see the red dots of the three-week long coast-to-coast drive my brother and I did across the United States, or see how 2018 travel compared to 2019 travel, or see how much ground I covered in January. Timeline will be less fun, and less interesting.Going mobile with Timeline There's plenty to explore in the mobile app too. Credit: Lifehacker While I'm disappointed the Timeline web interface is going awayas I'm sure are many people who use it for revisiting past trips, planning future trips, figuring out travel expenses, and whatever else the consolation is that Timeline will continue on mobile, in much the same way as it does currently.There are even a few nice bonuses in the app version that aren't available on the webor at least not in the same convenient form. You can see all of the places you've stopped at while Timeline has been active, sorted by category and by place: So you can dig into all the hotels you've stayed at or all the attractions you've seen, or get a list of every place you've seen in one particular city.Then you've got an Insights tab that gives you a breakdown of your travel activities: How much walking, driving, and flying you've done, for example, and how much time in total you've spent outside the home and the office. This is broken down by month, and you get a little highlights summary as well.The Day tab doesn't give the comprehensive map overview that you get on the web, but at least it's something: Load it up and you can see the places you went, how you got there, and extra information such as the amount of time you spent walking and driving. As on the web, you can add in places that were missed, as well as delete or edit entries.If you've never tried Timeline, or you're getting messages asking if you want to continue using it on your phone, I'd recommend giving it a goit's just about my favorite Google Maps feature. As with other types of data in your Google Account, you can delete your Timeline history whenever you want, or have it automatically wiped after a certain time (once it's more than three months old, for example).
    0 Σχόλια ·0 Μοιράστηκε ·120 Views
  • This iPhone 15 Pro Max Is Less Than $900
    lifehacker.com
    We may earn a commission from links on this page. Deal pricing and availability subject to change after time of publication.The Apple iPhone 15 Pro Max took home PCMags Editors Choice and Best of the Year 2023 awards in the iOS phone category. And if you don't mind getting a pre-owned device, you can snag its renewed premium model (meaning it has been inspected, tested, and certified to look and perform like new) for under $900 right now. At $863.97, this deal feels like a steal for one of the years most talked-about phones. And yes, its unlocked, so youre not tied to any specific carrier. Apple iPhone 15 Pro Max, 256GB, Natural Titanium - Unlocked (Renewed Premium) $863.47 at Amazon $1,007.00 Save $143.53 Get Deal Get Deal $863.47 at Amazon $1,007.00 Save $143.53 The 15 Pro Max has a titanium build, a durable material that makes the phone not just elegant but lighter than its predecessors. Weighing 7.8 ounces, its also noticeably easier to hold despite its 6.7-inch Super Retina XDR display. That screen, by the way, boasts crisp ProMotion technology (which allows it to adjust its refresh rate) for smooth scrolling and a peak brightness of 2000 nits, making it perfect for everything from watching HDR movies to snapping stunning photos outdoors. Speaking of photos, the camera system is a powerhouse. The 48MP main sensor delivers incredible detail, while the 5x telephoto zoomexclusive to the Pro Max modeltakes close-ups to the next level. Whether youre capturing sprawling landscapes or zooming in on your cats whiskers, the iPhone 15 Pro Maxs photography game is solid.Its powered by the A17 Pro chip, which Apple claims is not just faster but also capable of providing console-quality gaming on this phone (for anyone who is into mobile gaming or multitasking across demanding apps). The 256GB of storage is another win, giving you plenty of space for all those 48MP photos, 4K videos, and apps without worrying about running out. Its battery life is hard to beat, too (according to this PCMag review that gave it an Editors Choice award), comfortably lasting 20 hours even with heavy use.
    0 Σχόλια ·0 Μοιράστηκε ·119 Views
  • The best digital frames for 2025
    www.engadget.com
    Making a good digital picture frame should be easy. All you need is a good screen and an uncomplicated way to get your favorite photos onto the device. Combine that with an inoffensive, frame-like design and you're good to go.Despite that, I can tell you that many digital photo frames are awful. Amazon is positively littered with scads of digital frames and it's basically the 2020s version of what we saw with knock-off iPods back in the 2000s. There are loads of options that draw you in with a low price but deliver a totally subpar experience that will prompt you to shove the thing in a drawer and forget about it.The good news is that you only need to find one smart photo frame that works. From there, you can have a pretty delightful experience. If you're anything like me, you have thousands of photos on your phone of friends, family photos, pets, vacation spots, perhaps some lattes or plates of pasta and much more. Too often, those photos stay siloed on our phones, not shared with others or enjoyed on a larger scale. And sure, I can look at my photos on my laptop or an iPad, but there's something enjoyable about having a dedicated place for these things. After all, there's a reason photo frames exist in the first place, right? A great frame can help you send photos to loved ones and share cherished memories with friends and family effortlessly. I tested out seven smart photo frames to weed through the junk and find the top picks for the best digital frames worth buying. What to look for in digital picture frames While a digital photo frame feels like a simple piece of tech, there are a number of things I considered when trying to find one worth displaying in my home. First and foremost was screen resolution and size. I was surprised to learn that most digital photo frames have a resolution around 1,200 x 800, which feels positively pixelated. (That's for frames with screen sizes in the nine- to ten-inch range, which is primarily what I considered for this guide.) But after trying a bunch of frames, I realized that screen resolution is not the most important factor; my favorite photos looked best on frames that excelled in reflectivity, brightness, viewing angles and color temperature. A lot of these digital photo frames were lacking in one or more of these factors; they often didn't deal with reflections well or had poor viewing angles. A lot of frames I tested felt cheap and looked ugly as well, which isnt something you want in a smart device that sits openly in your home. That includes lousy stands, overly glossy plastic parts and design decisions I can only describe as strange, particularly for items that are meant to just blend into your home. The best digital photo frames don't call attention to themselves and look like an actual dumb frame, so much so that those that arent so tech-savvy might mistake them for one. Perhaps the most important thing outside of the display, though, is the software. Let me be blunt: a number of frames I tested had absolutely atrocious companion apps and software experiences that I would not wish on anyone. One that I tried did not have a touchscreen, but did have an IR remote (yes, like the one you controlled your TV with 30 years ago). Trying to use that with a Wi-Fi connection was painful, and when I tried instead to use a QR code, I was linked to a Google search for random numbers instead of an actual app or website. I gave up on that frame, the $140 PixStar, on the spot. Other things were more forgivable. A lot of the frames out there are basically Android tablets with a bit of custom software slapped on the top, which worked fine but wasn't terribly elegant. And having to interact with the photo frame via touch wasn't great because you end up with fingerprints all over the display. The best frames I tried were smart about what features you could control on the frame itself vs. through an app, the latter of which is my preferred method. Another important software note: many frames I tried require subscriptions for features that absolutely should be included out of the box. For example, one frame would only let me upload 10 photos at a time without a subscription. Others would let you link a Google Photos account, but you could only sync a single album without paying up. Yet another option didn't let you create albums to organize the photos that were on the frame it was just a giant scroll of photos with no way to give them order. While some premium frames offer perks like unlimited photos or cloud storage, they often come at a cost. I can understand why certain things might go under a subscription, like if you're getting a large amount of cloud storage, for example. But these subscriptions feel like ways for companies to make recurring revenue from a product made so cheaply they can't make any money on the frame itself. I'd urge you to make sure your chosen frame doesn't require a subscription (neither of the frames I recommend in this guide need a subscription for any of their features), especially if you plan on giving this device as a gift to loved ones. How much should you spend on a digital picture frame For a frame with a nine- or ten-inch display, expect to spend at least $100. Our budget recommendation is $99, and all of the options I tried that were cheaper were not nearly good enough to recommend. Spending $150 to $180 will get you a significantly nicer experience in all facets, from functionality to design to screen quality. Best digital picture frames for 2025 This article originally appeared on Engadget at https://www.engadget.com/home/smart-home/best-digital-frame-120046051.html?src=rss
    0 Σχόλια ·0 Μοιράστηκε ·138 Views
  • Hackers may have stolen hundreds of thousands of Rhode Islanders sensitive info in RIBridges cyberattack
    www.engadget.com
    Hackers behind a cyberattack that targeted Rhode Islands public benefits system were able to get the sensitive data including Social Security numbers and some banking information of hundreds of thousands of people, and they have threatened to release it as soon as this week if they arent paid a ransom, Rhode Island governor Dan McKee said in a press conference on Saturday night. The Rhode Island government opened a toll-free hotline on Sunday (833-918-6603) to provide information on the breach and how residents can protect themselves, but you wont be able to find out for sure if your data was stolen by calling in. People who may have been affected will be notified by mail.The attack targeted the RIBridges system, maintained by Deloitte, which is used to apply for Medicaid, Supplemental Nutrition Assistance Program (SNAP), Temporary Assistance for Needy Families (TANF), Child Care Assistance Program (CCAP), HealthSource RI healthcare coverage and other public benefits available to Rhode Islanders. A press release from McKees office notes that any individual who has received or applied for health coverage and/or health and human services programs or benefits could be impacted by this leak.Its thought the hackers were able to get information including names, addresses, dates of birth, Social Security numbers and certain banking information. Deloitte first detected the breach and notified state officials on December 5, and determined on the 11th that there was a high probability that the implicated folders contain personal identifiable data from RIBridges. It confirmed the presence of malicious code on December 13 and subsequently shut the system down, before officials announced the attack to the public the same day.The system is now offline while Deloitte works to secure it, which means that anyone who needs to apply for one of the affected programs will have to do so by mail, and people who are currently enrolled wont be able to access the online portal or app. The state said it so far hasnt detected any identity theft or fraud relating to the attack, but it will be offering free credit monitoring to anyone affected by the breach.This article originally appeared on Engadget at https://www.engadget.com/cybersecurity/hackers-may-have-accessed-hundreds-of-thousands-of-rhode-islanders-sensitive-info-in-ribridges-cyberattack-194621262.html?src=rss
    0 Σχόλια ·0 Μοιράστηκε ·146 Views
  • All I want for Christmas is the trailer for James Gunn's Superman movie, and it sounds like my wish will be granted very soon
    www.techradar.com
    The first trailer for 2025's Superman movie will reportedly take flight this week, and I can't contain my excitement ahead of its arrival.
    0 Σχόλια ·0 Μοιράστηκε ·142 Views
  • The secret to feeling good? Make friends with your fridge
    www.techradar.com
    Your fridge isnt just a place to keep your cucumbers cool and your fruit salads fresh. It can be your foodie friend too
    0 Σχόλια ·0 Μοιράστηκε ·145 Views
  • Bitcoin rises to new record above $106,000 ahead of this week's Fed decision
    www.cnbc.com
    Bitcoin rallied to a new all-time high Sunday evening as investors awaited the Federal Reserve's final interest rate decision of the year.
    0 Σχόλια ·0 Μοιράστηκε ·133 Views
  • Adobe shares suffer steepest drop in over two years on disappointing revenue guidance
    www.cnbc.com
    Adobe shares tumbled after the software vendor issued revenue guidance that fell short of analysts' estimates.
    0 Σχόλια ·0 Μοιράστηκε ·136 Views
  • How to make practical water and bubble effects
    beforesandafters.com
    Plus, old-school motion graphics anim and optical effects.Today on the befores & afters podcast, were chatting to director, cinematographer and VFX artist Christopher Webb about practical effects. Chris is the founder of FX WRX, an outfit that specializes in in-camera effects. Ive talked to him previously about several projects, but today were narrowing in on a Gatorade Propel spot achieved with some very fun water and bubble effects, and on a Tom Petty and the Heartbreakers video. For that video, FX WRX created some incredible analog motion graphic animation and optical effects, very 80s style. For each project, we go into detail about the shoot at FX WRXs studio, including with motion control camera equipment and bespoke setups.This episode of the befores & afters podcast is sponsored by SideFX. Looking for great customer case studies, presentations and demos? Head to the SideFX YouTube channel. There youll find tons of Houdini, Solaris and Karma content. This includes recordings of recent Houdini HIVE sessions from around the world.Check out the chat above, and the final pieces and some behind the scenes images, below.The post How to make practical water and bubble effects appeared first on befores & afters.
    0 Σχόλια ·0 Μοιράστηκε ·151 Views