New Benchmine Measure to Compare Investment Returns of Employer 401(k) Plans, provided by OnlyBoth, launched in late November a free, open-to-all website for comparing the performance of 55,000+ employer 401(k) plans, offering many novel analytic capabilities to users.

The federal data source (Department of Labor EBSA) reports many core measures of 401(k) plans, leaving it to users of the data to introduce derived measures that better enable broad performance comparison across plans of different sizes. This led to our incorporating total administrative expense ratio, defined as total administrative expenses (Line 2i(5)) divided by total assets (Line 1f), times 100. Note: all data sources referenced here are from Schedule H.

The Benchmine 401(k) engines needed a measure of plan-year investment returns that likewise enables fair comparison. The federal data reports on total income and net income during a plan year, but these include employer and participant contributions and rollovers, which tend to skew the returns on investment. Also, plans sometimes transfer investment assets out of, or into, the plan, which also complicates fair comparison.

After consulting 401(k) industry experts, we decided on a new measure yield on beginning-of-plan-year total assets (yield for short), defined as net earnings on investments (the sum of the 10 column (b) entries from section 2b minus investment advisory and management fees from Line 2i(3)) divided by total assets at the beginning of the plan year (Line 1f(a)), times 100. By itself, this doesn’t deal with the complications of mid-year asset transfers (section 2l), so we added a qualifying criterion that a plan’s asset transfers (incoming + outgoing) be less than 1% of its total assets at the beginning of the plan year. This prerequisite disqualifies about 5% of the 55,788 401(k) plans at Benchmine, which then get a value of N/A for their yield.

These three CY 2021 examples of employer 401(k) plans (names omitted here), from different total-assets brackets, stand out on their joint yield and administrative expenses:

  • Only PLAN (within the $10M-$50M bracket) has both such a high total administrative expense ratio (1.716%) and such a low yield (11.70%).
  • In California with its 209 ($250M-$1B) plans, only PLAN has both such a high yield (18.31%) and such a low total administrative expense ratio (0.002%).
  • PLAN has the highest total administrative expense ratio (0.360%) among the 196 ($100M-$250M) plans that have 1,000 to 4,999 total participants and have at least a 16.66% yield.

In conclusion, Benchmine is now equipped with good measures for both administrative expenses and investment returns, all in the service of enabling fair comparison, heightening performance transparency, helping to drive improvement, and empowering participant choices.

Raul Valdes-Perez

Infusing Performance Transparency into Employer 401(k) Plans

OnlyBoth is proud to infuse unprecedented performance transparency into the world of employer 401(k) plans, about 55,000 of them, in partnership with, by applying a unique AI-based technology for comparative analytics (e.g., benchmarking) to the latest, completed EBSA 5500 and Schedule H data. DCIIA is offering this service as a member benefit.

Visit this DCIIA page for an introductory video as well as a series of ten brief explainer videos that illustrate how you can discover answers to the following questions about 401(k) plans:

  1. Where does a plan stand out from related peer groups?
  2. How does a plan compare to a user-selected peer group?
  3. What’s best in class, i.e., the best achievement on a given measure by a similar plan?
  4. How does a plan score compare to all others in the same industry?
  5. How do up to 10 plans compare side-by-side?
  6. What are the overall top- or bottom-scoring plans?
  7. What are the ranked scores for a group of plans selected by industry or geography?
  8. What are the top benchmarking insights for a group of plans that I select?
  9. What are the top benchmarking insights that mention specific data attributes?
  10. What plans match the characteristics, measures, and/or geography that I select?

A Comparison of 23 Healthcare Comparison Websites

This study examines 23 prominent healthcare comparison websites that U.S. consumers, and other healthcare participants, can use to find and evaluate healthcare providers in terms of, for example, their quality, patient experiences, and cost. Our emphasis is on free website tools that offer side-by-side comparison of two or more providers, in a way that informs consumer choice but also provides value to other users within the large and complex landscape of healthcare.

Our goals in carrying out this empirical study are these:

  • Inform potential users about the landscape of website tools, analogous to how each tool informs users about the healthcare landscape.
  • Inform website-tool designers about the scope of existing features that could be offered by their tools, and perhaps prompt thoughts about non-existent features that should be invented and added.
  • Data on healthcare providers contributes to transparency and helps drive improvement via several alternative pathways. Similarly, data on comparison websites enhances tool transparency and can drive improvement among comparison tools.

In science it is understood that when there exist numerous similar entities, it helps to overlay a structure based on identifying aspects on which they differ; such is our intent here. Often such structure leads further to devising categories that help make sense of the landscape, much as how medical science forms categories out of individual diseases.  

Continue reading

Evaluation Engines Recognized as Finalist for Healthiest Communities Data Challenge

At the recent Health Datapalooza held March 27-28, 2019 in Washington DC, OnlyBoth’s submission to the Healthiest Communities Data Challenge was recognized as one of the three finalists, all equal co-winners of the Challenge.  We submitted our portfolio of four distinct comparative-analytics engines – benchmarking, comparison, discovery, and scoring – to extract maximum value from the rich county-level dataset on Social Determinants of Health, made available to the pool of 30 contestants, to which we added Census data on county populations.  The resulting engines are accessible via, or directly at

Our supporting partners in this submission were HealthBegins and the Allegheny County Health Department.  The benchmarking engine, used to deeply assess the comparative performance of a single county against others nearby and nationwide, was augmented by HealthBegins with recommended actions to take for certain performance deficiencies. For example, here is a comparative deficiency for Bronx County:

Bronx County, New York has the most adults who don’t eat enough daily fruits & vegetables (77.60%) among the 64 counties with at least 876,764 in population (Bronx County, New York is at 1,471,160). That 77.60% compares to an average of 73.58% and standard deviation of 2.90% across those 64 counties. […]  Among those 64 counties, it also has the most adult diabetes (12.3%).

which can be addressed by the recommendations seen by clicking on Taking Action.

Institutions and communities can take a specific, multi-pronged approach to increase daily consumption of fruits and vegetables among adults.  The CDC Guide to Strategies to Increase the Consumption of Fruits and Vegetables describes each of the following strategies in detail. […]

Want to evaluate your city or state?  Let’s say Boston.  Assess the social determinants of health in Suffolk County or Middlesex County.  Then do a side-by-side comparison of the 7 counties surrounding Boston.  Then examine the biggest achievements and improvement opportunities (i.e., deficiencies) across all Massachusetts counties.

Migrating to California, here are two interesting comparative insights for San Francisco county and San Mateo county.

San Francisco County, California has the most violent crimes per 100,000 population (702.66) of the 78 counties with at least $78,621 in median household income (San Francisco County, California is at $81,294). […]

Only San Mateo County, California has both such a high median household income ($93,623) and such a high natural amenities index (8.19). […]

We at OnlyBoth are pleased to show that AI-style comparative analytics can enable unprecedented transparency in both healthcare as well as health, and thus help drive performance improvements and inform consumer choice, all for the public good.

Raul Valdes-Perez


Scope of Innovation of Healthcare Benchmarking Engines

The engines at, powered by Artificial Intelligence methods and principles of User Experience design, show these technology-enabled advances over the benchmarking status quo.

1. A user experience based on selecting questions to be answered and getting noteworthy insights as answers, rather than pushing lots of data and dashboards without a clear sense of what is being answered and what is noteworthy. The first-encounter UI poses these questions:
o How is this provider doing? (i.e., where does it stand out positively or neutrally?)
o Where could it improve? (where does it stand out negatively?)
o Where has it changed? (over the last year or two, what changes stand out?)
o What’s best in class? (what are top achievements on specific measures by similar providers?)
o Where does it stand in its county? (or other geography, based on scoring the insights found)

2. Insights are written as perfectly readable, and shareable, English sentences, rather than dashboards. This key novelty led to our trademarked “A sentence is worth 1,000 data.®” and addresses a problem identified in the National Academy of Medicine article Fostering Transparency in Outcomes, Quality, Safety, and Costs, that ”Research has demonstrated that many of the current public reports make it cognitively burdensome for the audience to understand the data.” We believe that dashboards are fine to alert that a warehouse is on fire or your car is nearly out of gas, but not to motivate thoughtful deliberations on performance improvement.

3. Calculating provider latitude & longitude, which enables benchmarking each provider against others nearby, e.g., within 20 or 50 miles, or other distances selected by the user.

4. Insights are supplemented with highly-related facts which help the user understand the significance or scope of the stand-out behavior or outcome. These addenda are also written in precise English.

5. Peer groups are not limited to the usual state, national, and perhaps a pre-defined cohort. Instead, the engine does a massive search for peer groups, expressed as a simple combination of data attributes, in which the benchmarked provider stands out. Geographic proximity can be one of these attributes, alone or with others.

6. An especially novel type of benchmarking insight involves aligning two numeric measures. One measure expresses the stand-out behavior, while the second forms the peer group, possibly in combination with symbolic attributes. For example, “In Texas, Park Plaza Hospital in Houston, TX has the lowest nurse-communication rating (2 stars) of the 88 hospitals with as high a doctor-communication rating (4 stars).”

7. By specifying any known algebraic relationships among measures, the engine can insert action-oriented remarks such as the one italicized in this nursing-homes insight (see it online):  “Carroll Manor Nursing & Rehab in Washington, DC has the fewest total nurse staffing hours per resident per day (2.05) of all the 724 nursing homes that are located within a hospital. That 2.05 is 57% lower than the average of 4.8 across those 724 nursing homes. Reaching the average of 4.8 would imply an extra 80.9 nursing staff per day, assuming an 8-hour workday.”

8. Input data can be numeric, symbolic, yes/no, and even set-valued, which gives rise to innovative comparisons like this“Of the 1,488 hospitals that have at least 4 stars as an overall hospital rating, Shasta Regional Medical Center in Redding, CA is one of just 2 that have a 1-star rating in each of cleanliness, communication about medicines, doctor communication, and quietness (4 total).”

9. As discussed in an AHRQ report on usage of hospital evaluation websites, consumers and healthcare professionals often need different content. So, we have introduced a “Switch Audience” toggle, visible to the user when an insight contains content that appeals to one but not the other, which lets users declare their roles. See the difference by switching the audience to “professional” at this insight on emergency-room wait times. and noticing the paragraph that begins with “Note that …”

10. The final novelty is automation, so that many provider measures can be assessed with the same (human) effort, addressing this point by Dr. Robert Brook: “… quality must be measured in a comprehensive way in order to motivate an institution or physician to provide high-quality care. […] if just a few measures are used to assess quality, the quality of care delivered across all patients in all diseases will be distorted, emphasizing those things that are being measured. Fortunately, we have many well-tested comprehensive quality of care measures that can help prevent this distortion.” Moreover, automation enables introducing measures that express a change over time, and not just the last measurement period, so providers can be compared on how they’ve improved or gotten worse. For example: “… has the biggest plunge in cleanliness rating over one year (-2 stars) of the 1,193 hospitals on the East Coast.

Raul Valdes-Perez

A Unique Way to Get Others to Improve

Joe’s haircut is darn ugly. What are effective ways to persuade Joe, or other people and organizations, such as healthcare providers, to make improvements? One way is to issue orders, but that only works if you’re the boss. Another way is to reliably predict a bad outcome unless improvements are made, as in the case of budgets, health, safety, etc. Yet another way is to teach how to improve on a specific key measure, which can work if people or organizations are self-driven.


It’s often pointed out (e.g., in this Harvard Business Review article) that people are motivated by peer comparisons, which are effective because it’s human nature to notice others and be influenced by them, and because the comparisons are easy to grasp: Your Peer does better at X, and X is important, so try to measure up! But a comparison to a single Peer is subject to the defensive reaction that the Peer has very different circumstances, so the two aren’t comparable!

I wish to put forth a way to make peer comparisons that are arguably persuasive, but rare. People seldom think of them and they are hard to come up with without the help of automated comparisons of available data. These peer comparisons are characterized by a second measure, Y.

Consider telling Joe this comparison:  You have the ugliest haircut of everybody as good-looking as you! Notice that you are implicitly using two measures: (1) haircut ugliness, and (2) good looks. The peer group is everybody who is at least as handsome as Joe. Within this elite group, unfortunately Joe does the worst. On the one hand, Joe feels good about his comparison group, and on the other hand, he has the worst outcome, assuming he cares at all. And the peer group is large, unless Joe is stunning!

Now, for you logician readers, let’s acknowledge that “Joe has the ugliest haircut of everybody who is as good looking as him.” is absolutely equivalent to “Joe is the best looking of everybody with such an ugly haircut.” But psychologist readers will agree that the first version does better at motivating performance improvement, since a likely human reaction to the second version is “Well, at least I’ve got something going for me!

It turns out that automated experiments with healthcare or business data turn up a large number of such peer comparisons. Here are three actual, but anonymized, insights taken from various healthcare sectors at

1.    A California hospital has the lowest communication-about-medicines rating (2 stars) of the 358 hospitals with as high an overall patient rating (5 stars). Those 2 stars compare to an average of 4.3 stars across the 358 hospitals.

2.    In the Southwest, a Texas home health agency has the fewest patients who got better at getting in and out of bed (19.1%) among the 1,651 home health agencies with at least 49.1% of patients who got better at walking or moving around. That 19.1% compares to an average of 65.3% across those 1,651 home health agencies.

3.    Pennsylvania nursing home has the most short-stay residents who had an outpatient emergency department visit (34.1%) among the 317 nursing homes with at most 10.1% of short-stay residents who were rehospitalized after a nursing home admission. That 34.1% compares to an average of 9.8% across those 317 nursing homes.

The basic “shaming” message is this: Why are you so bad at X if you’re so good at the related measure Y? Everybody else with such a good Y is doing better than you! Of course, the world is filled with such potential insights, although coming up with verifiable ones may best be done with rigor by software, as long as data can be collected and analyzed.

Instead of merely ordering Joe to get a new barber, presenting him with a book on hairstyling, or predicting that his love life is doomed unless he improves, let’s try pointing out how poorly he stands out as compared to his wonderful peer group! The same goes for Doris the hospital’s chief quality officer, Nancy the home health agency’s chief nurse, and Mary the nursing home’s director.

[First published on LinkedIn Pulse]

Raul Valdes-Perez


Standout Scores: Express the Comparative Performance of a Nursing Home with a Single Score, Based on Reported Insights

OnlyBoth has launched a new benchmarking-engine capability which objectively scores how each nursing home across the country stands out from others, both positively and negatively. The resulting standout score is a count of how well a nursing home stands out compared to various peer groups, as seen in the engine’s reported insights for that nursing home. Although not designed as a comprehensive ranking, the scores express comparative performance over a broad range of criteria.


According to the OnlyBoth standout scores for nursing homes, the top 5 U.S. nursing homes out of more than 15,000 in the country are these:

  • #1 University Post-Acute Rehab in Sacramento, CA
  • #1 Kaiser Foundation Hospital Manteca Distinct Part Skilled Nursing Facility in Manteca, CA
  • #3 Signature HealthCARE At Sts. Mary & Elizabeth Hospital in Louisville, KY
  • #4 Brian Center Health & Retirement/Cabarrus in Concord, NC
  • #4 Manorcare Health Services-Green Tree in Pittsburgh, PA

The scores have several uses. First, let’s say that you’re preparing a candidate list of good-performing nursing homes for a prospective resident. To score all the nursing homes in a county, click on the Score nursing homes button and enter its name in the county search box.

Second, while you’re evaluating the detailed performance of a nursing home, click on the left-side question: Where does it stand in its county? in order to highlight its ranking within all those peers.

Third, to score all the homes within a peer group you select, say for investigative or marketing purposes, e.g., all government-owned nursing homes, click on Query across nursing homes and formulate your own query.

To see what underlies the top standout scores, check out some key reported insights on California’s University Post-Acute Rehab in Sacramento and Kaiser Foundation Hospital in Manteca.

University Post-Acute Rehab is the only one of 36 nursing homes in Sacramento County which has a 5-star rating in each of overall, health inspection, quality measures, and registered-nurse staffing. The facility also has the lowest total number of health deficiencies (zero) of the 36 nursing homes in the county.

Kaiser Foundation Hospital is only one of two nursing homes in all of San Joaquin County that doesn’t have any facility-reported incidents, substantiated complaints, fines, or payment denials. It also has the lowest total number of health deficiencies (zero) of the 26 nursing homes in the county.

It’s very illuminating also to check out what underlies the worst standout scores, nationally or just in your own county.

The scores are completely transparent, just like healthcare is becoming with the help of automated benchmarking engines. You can calculate a nursing home’s score yourself, in seconds, by going through its reported insights and subtracting the negative ones from the positive ones, as explained here at the bottom.

Our standard scores are similar in spirit to Nursing Home Compare’s overall rating, which is very complex. There is a substantial correlation (0.58) between standout scores and overall ratings. Standout scores are completely linked to a nursing home’s public, comparative performance along all the included data dimensions, and is completely automated regardless of new data attributes that may be added, e.g., on patient surveys, pricing, or consumer reviews.

By empowering new uses, such as consumers wanting to create a list of candidate nursing homes to visit or to evaluate more deeply, standout scores contribute to healthcare transparency and thus ultimately to the goal of driving performance improvement.

Why Comparing Healthcare Providers Needs Automation

I’ve lived for years in the same area of Pittsburgh, whose streets don’t follow a grid design since it’s hilly and pre-dates the automobile. Sometimes before driving to a familiar destination, I’ll check Google Maps, which alerts me to a favored route that I didn’t even know existed. I act on the suggestion which usually turns out great. Is this unique to mapping, or can this happen in other domains of reasoning and discovery? How about healthcare?

Solution Spaces and Artificial Intelligence

Automated mapping helps me discover new routes not because I’m spatially challenged, but because the software explores side streets which motorists like me don’t consider. Instead, motorists tend to consider the larger, familiar streets that head toward their destination. Using Artificial Intelligence (AI) concepts, we say that mapping software searches for solutions within a larger space of possibilities than people do. In chess play, software considers piece sacrifices which none but top players will ever think of. It also occurs in scientific research. This should happen in healthcare, too, where there are huge potential gains for many stakeholders and rich data sets are publicly reported.

[Continue reading at LinkedIn Pulse …]



Unprecedented Data-Driven Performance Transparency in Healthcare, starting with Nursing Homes

I am proud to announce that, as part of OnlyBoth’s strong focus on healthcare during 2018, we just launched a web-based Nursing Homes benchmarking engine that deeply leverages the latest, rich data on 15,646 nursing homes published in January 2018 by Medicare’s Nursing Home Compare.  The press release is here, and the new, “front door” to the engine is at, shown here:benchmineOne of our goals is to bring the ultimate performance transparency to healthcare sectors, leveraging initially the tremendous work done by Medicare’s contractors, nursing-home inspectors, and nursing homes themselves to contribute data for public access.

To further this goal, we have chosen to make the service simple, quick, and especially affordable. To evaluate a single home, users pay $9 one-time with a credit card or Paypal. To evaluate any of the 15,646 nursing homes, pay $39. To perform queries that go across all nursing homes, pay $99. These payments give access to one quarterly edition of the engine. We expect to create new editions every quarter, using the latest published data. Read here about the features available at different price points, which support various roles within the nursing home industry.

Finally, I’ll emphasize that the benchmarking engine, which discovers comparative insights worth knowing and writes them up in perfect English, without injecting biased opinion anywhere, generates more words in its nursing-home application – around 80 million – within insightful sentences than are contained in the entire Oxford English Dictionary or the Encyclopedia Britannica.

But don’t let that volume scare you. Just as Google’s search engine stores nearly all the world’s web content, but brings you a manageable number of results, worth knowing, that are relevant to your query, so does a benchmarking engine!

Raul Valdes-Perez

Benchmarking the CDC 500 Cities on 28 Health Measures

The CDC’s 500 Cities Project recently published 28 health measures on the 500 largest U.S. cities (see them listed or mapped). The measures cover various resident behaviors, afflictions, medication, and screening. We at OnlyBoth downloaded the data and set up a cities benchmarking engine to answer these standard comparative questions: How is this city doing?, Where could it improve?, and What’s best in class?

Just enter any of the 500 cities at Then click on a left-side question to discover noteworthy peer groups in which your selection is near the top or bottom. Or, set up a fencemarking query and click Go at the bottom to, for example, learn the top insights among all 121 California cities, or to uncover comparatively-high binge drinking there (guess who?).

To appreciate this technology and its simplifying potential to motivate human and customer progress, compare to how standard dashboards have been applied to the 500 Cities data. Or, to understand why dashboards aren’t really up to the task of comparative performance evaluation, check out Why San Mateo Daily Journal Really Doesn’t Like California’s Education Dashboards.

Lastly, if you also wish to benchmark counties, read here.

A sentence is worth 1,000 data.®

Raul Valdes-Perez