Automating Hospital Price Transparency v2 Data Collection

The Hospital Price Transparency “v2” Schema rolled out in July of 2024 had the potential to greatly increase data accessibility and standardization. We’ve been implementing automated data collection pipelines on the v2 schema, and are excited to report that we do indeed see significant improvements in data quality and availability.

But as with any rule, there’s plenty of rule breakers out there. Read on to see how HPT v2 changes the landscape for hospital data gathering, what’s working, and where the hard parts still are.

HPT Overview

The Hospital Price Transparency Law, effective since January 1, 2021, requires hospitals to provide clear, accessible pricing information on the services they offer. For the initial implementation, hospitals were required to provide information in a “machine-readable” format, but CMS gave few specifics on structuring the data, which led to substantial challenges with data standardization and accessibility.

Hospitals created files in disparate formats and fieldsets, or posted data into dynamic web portals that disallowed bulk access. This made extraction, aggregation, comparison, and analysis a labor-intensive and sometimes impractical task. As a result, our team at Serif Health focused on the payer data and limited our pursuit of hospital data collection.

HPT Schema v2 enacted two major improvements to resolve these barriers to access.

Mandated creation of a cms-hpt.txt file at the root of the hospital domain, which directly points to the machine readable files. No portals, tools, URL blocking allowed.
Mandated use of one of three data schemas: CSV Wide, CSV Tall, or JSON, with strictly validated column or key names.

Both are critical. The CMS HPT improvement allows for bulk location of files, without worrying too much about needing to leverage headless browsers or user-agent hackery. Fixed data schemas means extraction is straightforward - you only have to create and support three ‘parsers’ instead of effectively thousands.

Automating Bulk Extraction with CMS HPT

The American Hospital Association reports 6,120 hospitals across the USA. To run extraction across all of them, first thing you need is a hospital directory and URL list. Starting from something publicly posted and relatively exhaustive (like the Persius Github: https://github.com/TPAFS/transparency-data/blob/main/price_transparency/hospitals/machine_readable_links.csv), you can bulk-query every unique URL in sequence to see how many have a cms-hpt.txt file.

What this tells us is rough TXT compliance rates for the broad set of hospitals out there. Other trackers have quoted compliance in the 60-65% range as of November 1. Doing a straightforward CURL of the TXT files en masse, we get similar results. Out of 2771 unique URLs tested, we get 60.1% successful hpt file responses, and a blocked hit rate of 5.1%.

In a previous blog post, we covered CMS HPT request blocking and some strategies for working around it - check it out if you want to unlock that extra 5.1%.

Complication: Many-to-many domain ⇔ hospital ⇔ MRF associations

But, you ask, didn’t you say there were 6120 hospitals in the country? Why only 2771 domains?

This gets to an immediate (and essential) complexity of gathering hospital data with the V2 schema. A health system often creates and maintains multiple separate hospitals, which can all aggregate up into one or more text file listings. This can make our life easier, like for HCA healthcare hospitals in Texas, where a single text file has over a hundred hospitals ‘covered’ in one fell swoop. But that’s not HCA’s only HPT file. You also have to grab their Florida entity, Midwest entity, Virginia entity, and Medical City TXT files, amongst others.

Opening that first HCA Houston text file, you’ll notice another complication in the ingestion process: a single MRF entry can apply to multiple locations. The URL https://www.hcadam.com/api/public/content/62-1801360_hca-houston-healthcare-clear-lake_standardcharges?v=cb1f849b&download=true, for example, appears on five different lines of the file, under different hospital labels.

Hospitals and systems also merge, acquire, divest, expand, and shut down each year.

All this to say, systems and hospital relationships are complex and dynamic. If you go down the path of automated ingestion, you’ll very quickly wind up having to build an operational process to curate the many-to-many relationships between a health system, its actual physical hospital locations, and the MRF files which cover those locations.

Net, out of the successful 1666 HPT files queried, we were able to gather 3853 distinct MRF URLs for 4,379 unique hospital locations. 71.5% coverage isn’t bad!

Defined Data Schemas: Simplified Parsing & Data Validation

V2 also introduced a defined and uniform schema for the pricing data itself, streamlining expected fields and their definitions to create a truly machine-readable format and allow for more effective automation and analysis.

Key features of Schema v2 include:

Strict File Formats: Schema v2 requires the format to be CSV or JSON. CSV can be wide, in which no data is repeated but payer and plan names can span a variable number of columns in the CSV, or tall, in which only one payer is in each row but some common data winds up being repeated across multiple rows in order to show pricing across the entire payer set.
Uniform Field Names and Types: Standardizing field names and data types minimizes ambiguity and ensures that data from different hospitals is consistent.
Nested Data Relationships: Schema v2 allows for nested data structures, supporting the complexity of healthcare pricing data and creating an organized format that facilitates automation.

Of course, in the real world no one follows a schema exactly. We’ve seen:

Tons of weird file extensions. CMS requirements give you a pretty constrained choice: you can pick either CSV or JSON. But that hasn’t stopped hospitals from getting creative and continuing to put schema’d data into non-standard file types and extensions, like .doc, .docx, .TXT, and .xlsx.

Arbitrary nesting of data in different hierarchies within json files. Despite the clear JSON standard that CMS posted, with the goal being to avoid - in their words - “generating a deficiency”, we’ve seen files where code and price values are stacked within other code and price values, and top-level information like the name and address of the submitting hospital, is only found deep within the file.

Filler data. The documentation clearly says “Do not insert a value or any type of indicators (e.g., “N/A”) if the hospital does not have applicable data to encode”. In practice, many hospitals choose to be verbose in the absence of data: 0, Null, ‘N/A’, and None abound, and true/false flags live alongside numeric values in code columns. Even trickier, difficult to detect ‘filler’ values - for example, we’ve seen 5555 and 99999 used alongside actual price values - can make it harder to understand what’s real vs. filler and require additional filtering.

Complication: Non-trivial manual work rate persists

Across the hospitals we’ve pulled who have the TXT file implemented, ~75% of them meet the schema (with minor allowances made for misspellings, pluralization, and data type mismatches) out-of-the-box.

The corollary is that 25% of the hospitals out there don’t. To gather data from those files, you have to attempt ingestion, watch it fail due to schema violations, introspect why it failed, and then write a custom parser ‘template’ that remaps the headers, or navigates the bespoke file structure to get the right data into the right columns. If that takes even 10 minutes to do, you’re looking at hundreds of hours of labor manually annotating to get a full data set from the hospitals that have a CMS HPT file.

And that’s on top of the roughly 30 minutes per hospital for the 1,741 non-TXT hospitals where you’ll need to go to their website, navigate to the files, figure out if and how you can download it, and implement custom parsing logic for whatever likely bespoke format they’re still using (assuming no TXT file means also non-compliant on the pricing data schema as well). Nearly a thousand hours of labor there.

For organizations trying to open and review a handful of the files, you’re probably good to attempt that without outside help. For organizations looking to gather ‘all of it’, expect that you’re going to need a small team fully dedicated to the initiative, on an ongoing basis.

Conclusion

HPT v2 has been live for four full months. As this post detailed, the adoption rate is around two-thirds of the market (solid) and the schema enhancements are paying off with almost 75% of entities conforming to the specification.

Despite this, gathering and normalizing all the data still requires substantial time and effort. Serif Health is committed to gathering the full hospital data set this quarter - for organizations who need broad access to hospital price transparency data, get in touch today!

Signal New

APIs

Data Delivery

Reporting & Analytics

Providers

Plans

Employers & Benefits Partners

Life Sciences

Innovators

Payers

Blog