undefined | Better HN

0 pointsyiksanchan4y ago0 comments

Hi NumberCruncher, thank you for your reply!

> do they have a feature store with built in time-travel functionality?

My feature store hasn't supported time-travel yet. But many SaaS implementations do, including Tecton, Hopswork, Splice Machine, etc. Open source feature stores haven't implemented this critical feature AFAIK.

> Why I didn't write about it? Because it was implemented in PL/SQL running on Oracle ExaData and in SAS. No one cares about feature stores implemented with tech like that.

Haha, I do. Actually I am thinking about implementing feature stores using some "old-school" DB technology. By the way, just curious since I've never used PL/SQL: Are you able to implement the time-travel functionality using pure SQL?

0 comments

NumberCruncher4y ago

> My feature store hasn't supported time-travel yet. But many SaaS implementations do, including Tecton, Hopswork, Splice Machine, et

Good to know, I wasn't aware of them. At the end of the day the make or buy decision is influenced by the cost factor. In our case the development costed ca. 50k€ (obviously no SV salary), we had a user base of 10 data-scientists (incl. myself) and someone in house who knows the ins and outs of the system and can implement new stuff. I have to admit I was sharing my office with our DB architect, it would be hard to put a price tag on his support. I don't know how the pricing of the SaaS solutions compares to this.

NumberCruncher4y ago

Almost all features (maybe with the exception of gender) were aggregations over time like "share of emails the customer opened in the last X days", or "did the customer use service Y in the past: y/n", etc. If you use a pseudo SQL code like:

CREATE OR REPLACE target_table AS

SELECT

i.customer_id

... some aggregation processing data from t ...

FROM init_table i

LEFT JOIN transactional_data t

WHERE i.... = t....

AND t.time_stamp BETWEEN i.reference_date - (X days) AND i.reference_date

GROUP BY i.customer_id

than depending on what you put into the init_table, you can obtain:

- in the prod environment: actual feature value(s) for one ore more customers (reference_date = current_date)

- in the model-dev environment: historic feature value(s) for one or more customers for one or more past dates (reference_date = [past_date1, past_date2, ...])

In different parts of the system (prod, model-dev) you can use the same SQL code, you only have to replace the init_table and the target_table. Implementing a new feature system-wide basically means putting a new aggregation into the SQL code, only once.

I used PL/SQL as glue code, like someone would use python. I.e. executing SQL code in a controlled manner, creating dynamic SQL the right way (out of the box Oracle solution) and the wrong way too (sticking SQL fragments as strings together), writing code-generators, managing indexes/partitions/views/materialized views/etc. for performance tuning, implementing centralized monitoring for bug and performance tracking (and to know how the projects of my colleges are going forward), and last but not least, for packaging. At the end of the day you have a library/modul (PL/SQL package) exposing functions (PL/SQL procedures). The "library" is partially type-checked on compilation on the DB, and than you grant the necessary execute rights to the DB users (distribution to the data-scientists) and they work with the historic results on their own VM (DB schema). Current code is searchable in the DB, all the users use the same (actual) version, code history is checked in into bitbucket. If a user wants to "fork" and create new features, he/she can just write plain SQL against the prod DB on his/her own and the new code will be migrated through pair programming into the prod environment.

Sorry for the formatting.

yiksanchanOP4y ago

Thanks for sharing!

This SQL-intensive actually reminds me of the feature store solution by Splice Machine (https://splicemachine.com/press-releases/splice-machine-laun...). They implement a HTAP database and use that in a similar as you use PL/SQL. I had a writeup on this as well (https://yiksanchan.com/posts/splice-machine-feature-store)

jamesblonde4y ago

Hopsworks supports time-travel since version 2.4 and it is open-source.

yiksanchanOP4y ago

Sorry I miss that

j / k navigate · click thread line to collapse

0 comments

NumberCruncher4y ago

> My feature store hasn't supported time-travel yet. But many SaaS implementations do, including Tecton, Hopswork, Splice Machine, et

NumberCruncher4y ago

CREATE OR REPLACE target_table AS

SELECT

i.customer_id

... some aggregation processing data from t ...

FROM init_table i

LEFT JOIN transactional_data t

WHERE i.... = t....

AND t.time_stamp BETWEEN i.reference_date - (X days) AND i.reference_date

GROUP BY i.customer_id

than depending on what you put into the init_table, you can obtain:

- in the prod environment: actual feature value(s) for one ore more customers (reference_date = current_date)

- in the model-dev environment: historic feature value(s) for one or more customers for one or more past dates (reference_date = [past_date1, past_date2, ...])

Sorry for the formatting.

yiksanchanOP4y ago

Thanks for sharing!

jamesblonde4y ago

Hopsworks supports time-travel since version 2.4 and it is open-source.

yiksanchanOP4y ago

Sorry I miss that

j / k navigate · click thread line to collapse