Industrial Time-Series Data Quality and Reliability with Timeseer

Denis: Welcome to the
Industrial Data Quality Podcast.

My name is Denis Gontcharov, I'm your
host, and today I'm here with Thomas

Dhollander and we will talk about
the industrial time series, or more

specifically, how we can guarantee
data quality, what we mean by data

observability and reliability, and
how Timeseer plays a role in this.

But before I continue, let me
first introduce my guest speaker

on the show, Tomas Dhollander.

Thomas welcome to the show.

Thomas: Thanks, Denis.

It's a pleasure to be here.

Denis: It is great to have you.

So I assume for people who are not
familiar with you, could you describe

what you do and what you've done in
the past in terms of data quality work?

Thomas: Sure.

So my background is in industrial
time series, so that's a good fit

with with the podcast title, I have
been doing work with that kind of

data for the past 15 plus years now.

Most recently I am working with Timeseer.

We are building a data quality and
observability platform for Industrial

Times series data, both in manufacturing
as well as in utilities and IoT markets.

Before that I have been working as a.

Founder and CTO at TrendMiner, which
is a tool company that built a tool

for industrial analytics self-service
analytics for process engineers

and related roles in and around the
control rooms of manufacturing plants.

And it was during that work that
we realized that the quality of the

data that we were dealing with, or
that our customers were dealing with

was often not good enough to scale
the analytics initiatives that they

were that they were developing.

And that eventually led us to found
this new company called Timeseer.

My background, even before that is one
of engineering primarily automation,

background, machine learning,
cognitive and neural modeling as well.

Yeah.

I hope that sets the scene enough.

Denis: Yeah, absolutely.

That's very clear.

And in fact for the listener, I'm
very happy to have Thomas here on the

show because in fact we go way back.

I remember we had several calls now over
the past, at least two years it must have

been, where I obviously talked exchanged
ideas about data and data quality.

So I'm sure we are gonna have a very
interesting discussion for the listener.

Great.

So Thomas, let me focus back
on the main topic of this call,

which we defined as essentially
looking at industrial time series.

Could you help clarify
what you mean by this?

Thomas: By industrial time
series you mean, Denis?

Denis: Yes, exactly.

And you maybe also focused
more on the aspect of data

reliability and observability.

What the difference between those
two with regards to that data.

Thomas: Sure, so the type of data
that we're talking about and I guess

we're on the same page here, is data
obtained from sensors, machines,

installations in manufacturing plants.

Could also be sensors that are
in utility networks or elsewhere,

but machine generated data
that is massive time series.

So samples over time being gathered
from many different sensors.

And that kind of data is often
forgotten when people talk

about data quality initiatives.

It's very often about
tabular relational data.

All the initiatives are typically
starting there, focused there on the

sales and marketing data, for instance.

But the data that is coming from all
those sensors, and there are increasingly

many, and they're exponentially
increasing in certain markets is also

very important to gather under control.

That kind of data is at the heart of
operations of many companies in in, in

the Fortune 500 if you think about it.

We started focusing exclusively
on the management of that type

of data in with respect to data
reliability and observability.

And so what you, what we mean with
reliable or high quality data in that

context is data that is fit for the
purpose for which it is being collected.

And that's a stretchable notion
'cause it depends on what you are

actually going to do with the data.

What kind of expectations you
have or you want to, safeguard and

to, enforce on the data there.

But at the very basics, it often means
that you want the data to correctly

represent what is happening in terms
of the physical world, the physical

phenomena that is being measured.

That implies you don't want
gaps, for instance, in the data.

You don't want spikes that
don't correspond to what your

actual temperature is doing.

If you're measuring a temperature.

And depending on the use case, of
course, you may need a lot more

than just those kind of checks
or those kind of safeguards.

For instance, if you're in a real
time setting, you want the data

also to come in without delay.

If you are sharing data in an organization
or beyond your own organization, you

want it to be discoverable as well.

So you want good metadata so people can
actually find the data, relate data from

different assets to each other and so on.

The term observability is
a bit different I think.

I always like to go back to the basics of,
or the origin of the term observability.

I have a control systems background,
and the term observability is one

that, as far as I know, originates
from control systems theory.

And it actually means that, the data you
are looking at, the available measurements

you're looking at are sufficient or
sufficient description to look inside.

It are all the information you need
to know about the hidden or latent

state of your system, of your process.

It means you, you have enough insight
into that state via the data you

are actually seeing or observing.

And yeah, I like to always go
back to that notion when people

talk about about observability.

And so we are focused on, on, on looking
at both those aspects of observability

in general, but also on, on the, with
respect to reliability assessment

or quality assessment and cleaning,
fixing of quality of that data.

Denis: If I can just sum zoom in a bit
more on the aspect of observability.

Could you comment on why one may want
to observe the data, or to what purpose?

We want to have this observability.

Thomas: Yeah, it's important to
understand what the state of the data

is as you start scaling your operations
and they start to depend increasingly

on data that's being generated.

Of course you have your operations,
your processes that are running with

your control loops, for instance, and
your control systems your operators

taking decisions on that data.

So from that perspective, from
an operations perspective, that

data is already very important.

But people are increasingly also starting
to see other use cases that are possible

and that depends on data obtained from
sensors in the field, for instance.

And so from that perspective, if massively
scaling the number of sensors you want

to stay on top in terms of observing
what is happening with your fleet of

sensors and devices, to name one, one
example that is not from a manufacturing

context, but from a utilities context:
if you have smart meters that are being

rolled out by the millions, if you're
a utility company, you're sending out

bills, you're doing leak detection.

If you're a water company or
you're trying to forecast demands

and so on, you want to have an
observability on all of those meters.

But it's a challenge because
you have millions of them.

So how are you going to go about
summarizing what's happening,

detecting problems without having
too many false alarms to deal with?

So that is what the challenge of
data observability, but also the

opportunity it represents are all about.

Denis: Yeah, absolutely.

Before I continue focusing on the
challenges, I want to return to an

interesting point you mentioned because
I also see it happening in my projects.

At the same time, we established
that this data is very important

. But at the same time, companies seem
to consistently underestimate just

how difficult it is to get right.

I dunno if you see the same thing in
your work, but very often a client

of mine or a company I've worked at,
they just think the data is there and

it's fine, but it's very often not.

Thomas: Yes, that.

I recognize it.

It depends a lot on what state of maturity
and what state of the ambition level

we like to call it the company is in.

If your ambition is to keep a company
running and to keep your factory operating

then you might be fine by all the
workarounds that have been implemented

over the years to deal with unreliable
aspects of that data, and there might not

be a big feeling of we need to fix this.

That is sometimes what we see
and especially what we see

in companies that are dealing
with this from a plan by plant.

They are still interested In data
operations and we still do a lot of

work with them in scaling use cases like
control loop monitoring or specific asset

or process problems they want to identify.

But just purely thinking about
data quality is not necessarily

a topic or a top concern.

Now as you start wanting to leverage
that data more and you start going into

lifting that data from the factories
or from the remote assets one level up

into the organization to do things like
feed into global initiatives, be that

energy monitoring initiatives, or bigger
modeling initiatives, or maybe to fuel

a as a service transformation, if you
have a, if you're an equipment vendor

and you start gathering data and you
want to use data to transform your company

into a service-based model, then that
becomes a different picture altogether.

And then the data that is good enough
for operations is maybe no longer good

enough for those kind of use cases.

And data quality becomes another concern.

also see companies that, are
going into governance initiatives.

And then you have more the top down
motion of understanding that all the

data is also potentially a liability
if it's not managed correctly.

And you have teams dealing with
it from that perspective and down

on the rest of the organization.

I'm, I probably shouldn't use that
term, but you get the picture right.

It's more of a push from a
central governance perspective.

And that is also feeding into
that discussion sometimes.

But the whole idea of being serious
about data quality is really a shift

from being reactive to proactive.

And of course that need is felt
a lot more if you are already

in a reactive firefighting mode.

And so the more use cases you have with
that data the more things start going

wrong when the data is not good, and
the more people end up in a firefighting

mess with all kinds of users, being
alarmed when the data is not right.

And the more the need is being felt.

But it Depends a bit on
where you are in the journey.

Some companies are in early stage,
have enough with some workarounds and

some patches they have done over the
years and don't really feel the need.

But as you start being more serious
about monetizing the data or using

it to transform your business or
for your industrial transformation,

then it becomes more important.

And it's especially true for markets
where the number of sensors is ramping

up or for markets that are transforming
to an as a service model, for instance.

Denis: I think it's a very important
point, and I imagine we have various

listeners from very different industries,
maybe some from manufacturing, some

are more from utilities, from energy,
and I imagine they each face different

challenges with respect to data quality.

So maybe let's, before we focus on the
challenges, I think it's helpful to

sketch according to our experience,
which industries are, let's say, on what

level of a data maturity cycle coming
from the aluminum industry I can vouch

for those sectors that they are mostly
concerned with just producing goods.

So for them, in my experience, the
data was really more of a byproduct

and used for control in the moment,
but not really used afterwards

for analysis or offering services.

Because they don't feel the need yet.

What was your experience?

What fields have you worked in and how
important was data quality to them?

Thomas: We've seen some trends in terms
markets and verticals that are more

serious about data quality than others, or
seem to be more concerned or are already

more organized around it than others.

To some extent, I'm still wrapping
my head around this, to be honest.

Some markets we thought would be further
ahead but are not for some reason.

So markets for which this is already
a topic and that already have some

governance practices worked out
and some roles and responsibilities

clarified, or things like utilities
markets where you already have data

validation, for instance happening, data
stewards being appointed and all that.

Also, with respect to the data coming
from metering points in their network.

So not just from tabular data, but also
from industrial data we're talking about.

You have other verticals like
pharmaceuticals where data is

important for regulatory purposes.

And then it starts to be a
bit of a scattered picture.

We work with process companies
that are very serious about this

topic and that are actually rolling
out data quality monitoring and

observability across all their sites.

And we have other companies, especially
the ones indeed, that are more concerned

about just keeping the factory running,
very distributed in terms of ownership

across the plans, where that topic as such
is not the main focus point at the moment.

Again, that doesn't mean they're
not looking data operations answers

to some of the specific business
questions they're dealing with.

And we also work with.

these companies, but then the angle is a
lot more from the use case from a specific

use case that we're helping them tackle
for which there are data challenges.

And then if we have a number of use cases,
they often also start seeing the idea

of shifting things left in terms of data
operations and trying to do more early

on and capture also more problems early
on so they can become more proactive.

But it often starts there
from very specific use cases.

Again, in markets like chemicals or
also in the more heavy industries,

things are a bit scattered.

Companies are a lot
further ahead than others.

And in some markets like utilities and
pharmaceuticals, I think in general,

the market is a bit more advanced
already in terms of thinking about

data operations data observability,
sorry and data reliability management.

Denis: That makes sense.

So we have a very wide landscape.

Let's, for the sake of this discussion,
make a slice of this population and

focus on those companies that realize
that, okay, we need our industrial time

series 'cause we want to do things with
them, whether it's offering services

or something else, or compliance.

Let's assume we have a board of
directors that says, okay, we need this

data and we need it in good quality.

I assume the first step, as we
discussed in one other call, is to

get this data from the shop floor,
whether it's a sensor, whether it's a

historian or SCADA system to the cloud.

Am I correct to assume that there
is essentially a big shift going on

where they try to get the data from
the factory floor into the cloud?

Well, it's certainly something that we see
materializing and people have been talking

about it for more than a decade now.

But we see a lot of companies
that are actually now starting to

move larger volumes of data, also
industrial time series data, into

platforms like Databricks or Azure.

In particular the lakehouse architectures
are popular in that respect.

That doesn't mean they're
necessarily fully abandoning

their traditional systems.

It's often a layer that is
created in addition to the

systems that are already in place.

And you start to have a landscape that's
quite complex because you have this

additional layer in the cloud where people
want to have an integrated view on data,

not just from industrial time series, but
combine it with other data sources and

you are starting to copy OT data in there.

On top of that you have IoT streams
that are maybe directly going

there, sometimes retrofitted, again,
into the legacy or older systems.

So quite a complex landscape that's by
the way, another trigger to think about.

How do you guarantee that all
these pipelines, these more complex

landscapes are working are working
according to what you expect?

Do you get all the industrial time series
data you expect to get in your databricks

set up, or are you missing some?

Is it the same data that you're
seeing there or is the data

somehow altered on its way there?

. Thomas: That evolution happening.

Definitely.

And in particular, Databricks
is pushing a lot to get more

footprint in side of the landscape.

Denis: Just to clarify for the listener,
with the non OT data, I guess with the

existing systems, you mean things like
an MES system, relational data, sales

data, so anything but the big time series.

Is that correct?

Thomas: Well, I meant with
existing systems, I also meant

the existing enterprise historians
that people may already have.

But then maybe they also route the data
to Databricks or some other lakehouse

architecture on AWS or Azure that they're
adopting or Google if they want to.

I'm also talking time series data
itself, that is also often still going

to two destinations if you want to.

One is more the classical, OT
systems, enterprise historians.

And then the other part, the new part
basically is then the lake house.

Now, what we do start, what we do
see is that with the, with these

emergence the lakehouse center of
gravity with respect to the ownership

of this is also shifting, towards IT.

I'm not saying it is shifting
to IT, it's shifting towards IT.

It's always somewhere in the middle
between OT and IT, but that's also,

that's also, an interesting observation.

And by The way, when I talk about
enterprise historian vendors some

of them are doing a really good
job to also have an offering that

is actually on par with or that is
competing to be that new destination.

If you think about an AVEVA with Connect,
for instance, they're trying to also

conquer their place, use their unique
capabilities to also establish a footprint

in that second tier in the landscape that
is all around bringing more data together,

sharing with other companies and so on.

Denis: You mentioned already a couple
of challenges, so let's now dive

deeper on why this is difficult.

'cause I imagine if you're,
let's say an executive, you

may think what's the big deal?

I'm shifting data from my
shop floor to the cloud.

How difficult can it be?

And just to spark the discussion I'll
mention the points that you mentioned:

The first one was.

Do you get all the data?

So I assume that's
about data completeness.

The second one was, is the data the
same or it change along the way?

And finally, maybe the biggest
point we can tackle after this is

the whole OT and IT discussion.

Because I agreed that the difficulty
often arises because of this new interface

when IT needs stuff get done from OT.

But OT doesn't really want to
share that stuff typically.

Maybe start with the simple questions
first about getting all the data.

Thomas: Yeah.

Yes.

Yeah.

It's the first question is do you want
to get all the data in and that's that

and in the new layer in the landscape.

Maybe you don't, maybe you just
want to do it use case by use case.

Maybe you have enough with aggregated
data, but then do you lose a lot

of information content or not?

So those questions pop up when
you are going down that route.

You probably need to add more
context to the data if you

have data on the shop floor.

People understand very well
just by looking at a name of a

tag what that means typically.

But if you are going to use that data on a
higher level in the organization, then you

might need more context, one, for people
to understand what you're dealing with.

Two, also to basically do analysis
that is cross assets in, in, in a,

on a global context, for instance.

You may need to know what this
measurement is about in a structural way.

So metadata becomes more
important on that level as well.

Now the challenges often are obviously
are also on the lowest level as so we are

now skipping to the challenges related
to the interface with the cloud layer.

But the challenges are
also on the lowest level.

If you just think about the data itself
you have things like sensors being

replaced sometimes popping up with
different units after a replacement.

You have things like drift when,
because of calibration issues, you

have sensors that just stop working
for a while, or the flatline.

You have spikes, you have
compression issues if you've not

configured your systems correctly
and they're compressing the data.

That's the equivalent of
the aggregation question.

How do you aggregate data?

How aggressively do you compress data?

You have the missing metadata.

You have the fact that the data is
not equidistant, which is which is

something that you either are changing
when you move data to the cloud

layer, but then you need to be very
careful about it, or you inherit from

the original systems that the data
updates or the points that are actually

interesting to store are not equidistant.

So you need to be careful when you apply
any type of algorithm to that data.

For instance, there's the
fact that the volumes are big.

The structure of the data, I talked about
that a couple times already, but the fact

that you want asset structure or device
fleets and be able to analyze those.

So there's many challenges
that are intrinsic to the data

that are just being inherited.

Whether you buy that new layer,
if you have that already or if you

don't have it, that are probably
already present in, in your systems.

The whole idea of shifting from this
reactive to this proactive mode is

basically that you are on top of those
challenges, that you are managing

them rather than reactively concluding
that when you start a project,

the data is just not good enough.

Or when a user complains, you
need to trace back where it

went wrong in the systems.

Denis: Two questions.

The first one, you mentioned equidistant
data, how would you describe this?

Thomas: What I mean with that is that
the spacing between two samples in time

is not always the same for one signal.

So if you have one signal, then of
course at the lowest level on the

transmitter levels, for instance,
say data is being looked at a certain

frequency, but then at a higher tier in
your landscape, and that's true if you

route data an enterprise historian, for
instance, the data is being compressed

or the data points that are actually
selected for storage, are cleverly

chosen to minimize footprint on disc.

That means the data is not
necessarily equidistant or on the

level of your OPC or whatever.

So there is mechanisms that have
been built into those systems that

make the volumes less, but that
also make the data non equidistant.

And that is something that
you need to deal with when

dealing with this kind of data.

Denis: Yeah, that's clear.

Thanks.

As we can see from this discussion,
the problems occur across the entire

chain from left the source to the
right, the cloud and Databricks.

We usually spot 'em at the very
end when it's already too late, and

I think it's very hard to fix it.

Then we also have to go through
the interface between OT and IT.

You mentioned about focusing more on a
proactive approach instead of reactive

approach Can you zoom in a bit more on
that, what you mean by this and why we

should focus on problems at the source.

Thomas: What I mean with that is, is many
companies are, for instance, if you think

about anomalies that occur in a process,
if you are, if you're operating a process

plant, then many companies are monitoring
for process anomalies and then starting

a root cause analysis to fix an issue.

But, if you are already monitoring sensor
integrity continuously, for instance,

then you might already be able to track
sensor and data health issues before.

'cause we see that many of those
RCA analysis conclude or lead to the

conclusion that a sensor was actually
the root cause of the problem.

A sensor was not reliable.

So that's one more proactive about
that, just already doing that upfront.

But it's, it extends into other areas.

So if you have a problem with process
stability that, plan of a factory

is , detecting and then starting to
optimize or tune a problematic control

loop, you can also continuously monitor
all your control loops across all

your sites all the time and prioritize
initiatives on them continuously rather

than waiting until someone flags an
actual problem with the quality of the

product you produced, for instance.

The same with starting
an analytics project.

I gave that example already.

If you start an analytics project,
you want to do a modeling exercise and

you then extract the data and conclude
that the resolution of the data is just

not good enough to do that project.

You can, if you are proactively
monitoring the data, you could

already have seen that and you could
already know the data is good enough

before you start doing that project.

Another down to earth example, we recently
had an example of a site getting a 400k

EUR electricity invoice , wanted to
verify whether that invoice was correct

or not, took a look at the meters said
flat zero for a long time already.

Yeah, they had no way of complaining
about that issue or even understanding

whether that invoice was correct or not.

So if you are already proactively
monitoring those measurements, you can

be a lot more proactive in all of this.

But it's also about glitches
in measurements that

cause compressors to stop.

It's about a turnaround where wires
are being flipped and the sensor

is mounted the wrong way, leading
to instabilities in the process.

It's outlier values that affect
statistical process control projects.

There, there's so many examples of things
we see going wrong because people are

not really on top of their data and
the integrity of the data all the time.

Denis: I think it makes a lot of
sense to work on a proactive level.

But I do see one big challenge, and
I'm really curious, I want to learn

more about how Timeseer manages this.

We are talking about an IT system.

So essentially it's usually people
in IT working with the data, but the

proactive approach requires us to
work on the source, which is more

in the OT realm, far away from IT.

How do you bridge this gap between both?

Thomas: There is of course a continuum
from issues that are basically where

the root cause lies with the physical
sensor, the assets, typically something

an OT stakeholder or even an ENI person
is to attend to, to look at up to more IT

problems of pipelines and systems being
out for a while or systems causing delays

in timing and so on, on samples arriving.

So there is a true challenge, a real
challenge in terms of doing the triage

of things you detect and whether they
should be attended to by an IT person

or more by an OT or a business user.

So for that it is important you
do to begin with you do the right

checks for the right kind of data.

So we have developed things suites
of checks that are meaningful to

apply to different sensor types.

Also even different specific use cases.

And we have checks that you can do
in a generic sense that will detect

things like data gaps and so on.

And then the whole idea, of course, is
respon responsible to attend to those

problems, to basically look into them
and do the triage or configure the

alerts so they go to the right channels.

There is some effort involved there
to do that but with the right kinds

of checks that are tailored towards a
specific use case, a specific sensor,

and so on you can do a lot on that to
make sure that the problems are basically

ending up at the right person's desk.

Denis: So I hear there's a very big,
let's say, data governance component.

Is that fair to say?

Would that be a part of the
solution to look at who fixes what?

Thomas: Yes, of course in practice
so you have two types of companies, I

would say companies that really have a
governance structure already thought of

taught and established with respect to
this and that have, data steward roles

formally appointed and things like that.

There are not too many,
as in some markets, yes.

Other markets, no.

And you have companies
that don't have that.

And that's not necessarily a problem
because a data steward's role is

not something that is a new hire, a
responsibility takes and it's typically

one that sits on a use case, per use case
basis on a data product basis is maybe the

better way to frame it in a business team.

So picture you have a digital twin
initiative, then you may have a team

responsible for the digital twin.

so if data problems are being detected
that affect the workings of that

digital twin, then that will end up at
the team's plate, basically that it's

managing that digital twin solution,
and that takes the end responsibility

of the digital twin and the quality
of service of that digital twin.

So you probably have those people already.

That are actually taking
responsibility in that project.

It's just a matter of deciding
who, who deal with that.

course, if you extend beyond just pure
data quality, then we also pick up

problems that are of direct interest to
those business people immediately and

not pure, data quality issues, but that
indeed are going beyond data quality

appointed asset or process problems.

Yeah, those are often of direct
interest also to those business users.

So it's not necessarily an
extra burden in that sense.

And certainly it's typically not
net extra hires in terms of data

stewards that you necessarily need.

You just need to think a little bit of
who managing the infrastructure side of

this and who is actually exposing the
capability of data ops operations, data

reliability and quality management.

Who is basically owning that
infrastructure and that service

in the landscape, which is
typically more of central team.

And then you have the business teams that
are basically benefiting from the insights

that we generate or that other solutions
for this kind of problem generate.

Denis: It does make sense.

In fact, what we are focusing here on
now is the very big people and process

component of data quality, right?

I think a very big contribution
to the solution has to come from

a better organization and better
governance of data and systems.

Thomas: Yes, So it's not unimportant
that data quality is not a.

Technical problem only.

It's half of it is a
organizational question.

How important do you think the topic
is for which use cases is important?

And how are you going to organize
yourself around those those

use cases and those, topics.

Yeah, so again, certain companies
are a lot further advanced, have

very formal structures there.

But in, in essence, what we often see is
that it boils down to a to two components.

You have a ownership part of
it, which is a teams that are

responsible for, the the central
systems that are basically, preferred

solutions to deal with those topics.

you have the teams that are or
the data product teams ideally,

that are taking responsibility for
data products in the landscape.

What is often forgotten there is that
the whole of the data historian or the

equivalent of that in in, in a cloud
tier is also to a certain extent the

data product on its own because people
can actually tap data from there and

can start doing work on that data.

And so that is, a data product
that also requires some governance,

at least in a more broad sense.

so we often get the question,
do we need to monitor all data?

And then we say, yeah, you should
probably do some basic sanity checks on

all the data just to make sure that when
people touch data from data historian

for instance, they have some minimal
guarantees that they can rely on with

respect to what to expect from that data.

But then of course if you have
specific use cases, I mentioned

digital twin, which we might be an
energy dashboard use case or a billing

use case or a specific AI model.

For those use cases, you might want to go
a lot deeper into the data quality and the

data guarantees that you want to there.

So you can think of this as a layered
service level concept where you have

a basic service level that you want
to guarantee on a basic data product

like your enterprise historian with
its APIs and SDKs, and then you

have very specific data products
that are maybe from a business

perspective more directly critical.

And you probably want to give more
deep and forgoing guarantees on that.

Denis: I like hearing
the word data product.

It really enforces the idea that you
should treat data as something important

to produce and take care of it.

We talked quite a lot about the people
and the process aspect of this problem.

As you mentioned, there's
also a technical component.

Let's imagine in our fictive
enterprise, we have a motivated boards.

Data quality is important.

They have sufficiently, let's say,
modern people and process in place,

have clear ownership of the data.

What would be the next frontier
of data quality difficulty?

Thomas: One of the biggest challenges
in this is making it possible

to scan lots of data, which is
a challenge on its own right.

And we spend quite some effort
getting that but it's also scalability

from a manageability point of view.

Like how do you keep things manageable
as the data volumes that you're

actually scanning increase drastically.

You don't want to get
a lot of false alarms.

You also don't want to spend to have
people spend days and days configuring

stuff keeping things up to date

so how do you keep this maintainable?

How do you make sure it scales also
in terms of the handling of results?

That is I think one of the bigger
challenges That that, we often see

people starting with, okay, we are gonna
do data quality, let's configure some

rules and we are going to do some rules.

On the data, and that works quite well.

But then of course, as you are starting
to scale the amount of sensors that

you're assessing keeping those rules
tuned correctly so you get the right

number of alerts and not too many
and not a few is a real challenge.

You might want to do that more based
on the statistics of the signal.

So you get into this model based
step where you're trying to do

the modeling of the statistics and
the behaviors of the signals and.

And.

do it based on that.

But then the next steps come maybe
you still get too many alarms.

How do you tweak the sensitivity
of what you're doing?

And you can use statistical significance,
but maybe that's not really giving

you the right output yet, because
something that is significant

is not necessarily interesting.

We can have a whole debate about that
topic, and then as you start going

into the real time monitoring, like you
then want to deploy something that is

actually doing this on ongoing basis.

Not purely assessing a data set, but
it's running along with your process

that's generates new data all the time.

That's yet another challenge to
nail that, to get that right.

Then you start to see, you need the
data steward layer for people to

actually look at results in a good
way to get some high level insights

presented that they can then dive
into and see, evaluate, maybe do

interactive stuff with the data as well.

Then you start to realize that,
yeah, maybe you also want to

fix data in some cases, right?

So it's not just looking at
data and finding problems.

Maybe you also want to have pipelines
that actually alter the data and fix it.

Not necessarily to override the
original data, but to then generate

new derived versions of the
data that are more high quality.

That is a very big challenge.

I can tell you we spend a lot of
efforts getting that implemented.

Especially if you combine it with a real
time context where new data comes in all

the time and you have data that is already
partially fixed or how you deal with that.

And then you have questions
like, how do you certify data?

So it's really fit for use and
people are actually committing to

it being fit for use in some cases.

So if you bring all those ingredients
together, the technical challenge of

implementing a data quality solution
for industrial time series data is big.

And we see people happily starting on
that first step, but once they start

seeing this full picture, they realize
it's maybe wise to partner up this topic.

But it depends on where the
company is and of course the data

volumes, the needs in terms of
stewardship and so on and so forth.

Denis: I think that's very interesting.

And I assume with the fast pace of
modern technology, I imagine we have

some new possibilities in that regard.

How can you leverage technology to address
these challenges you just mentioned?

Thomas: Yeah, there is a lot of
technology to use these days.

So of course there is different parts
of this challenge that I just mentioned.

Part of it is just, yeah,
implementing a good, usable solution.

Not only a UI but also Git Ops, also
SDKs, also Power BI connectors and so on.

So just work basically to get that and
to understand how usable interface works.

So you have things like, for instance,
the whole Apache Arrow ecosystem has

transformed a bit the way we think about
processing data or batches of data from

sources like Time series, data stores.

And that's also the technology
that modern solutions for Time

series are built on or adopting.

You have the Delta Lake concepts
themselves, which are also relevant.

Not only as a source, but also to, to
basically leverage that for internal

storage structures as well, and so on.

So there are technologies like that.

Now there is AI and ML 'cause let's
make that distinction a little bit,

to be leveraged In different areas.

Here, for instance, you can leverage
foundational models for time series

reputations to, to fill in gaps
or just plain machine learning.

For that, for anomaly detection, you can
leverage a lot of these techniques as

well, can also, tap from LLM capabilities
and large language models to help

with things like extracting metadata
from spec sheets or to get help in

suggesting the right checks to use for
a certain use case or a certain sensor.

And also to help interpreting
results we start to see that

is also becoming possible these
days to use technology like that.

And so if you combine all of that in
a good way, in a smart way you can

work around many of the challenges
that are just mentioned and make

sure that your data quality solution
actually the headaches and the false

alarms, that people are often plagued

Denis: Yeah, that's a fair point.

I think it's very interesting to see
the Problem decomposed in a technical

aspect and a non-technical aspect.

One thing I'm wondering with and of the
struggle with my own projects is that

how do we actually deal with bad quality?

Do we block this data
from entering the system?

Do we keep the system clean
or do we like write it anyway

and then try to fix it ad hoc?

I think the lateral approach
is challenging 'cause you just

have a lot of work afterwards.

But as the first approach is
challenging 'cause you're missing data.

Thomas: We always recommend to store
the raw data without altering it first.

So that's one part of the answer.

it doesn't mean you need to use raw
data without altering it even, but

just to have that system of record.

Because if as soon as you start
altering data, you, especially if

you're doing it in more complex
ways, that might be tricky.

So we like having the raw stored
of course that doesn't mean,

again, that you are feeding
the raw data to your use cases.

Say you can have derived
tiers of data where cleaning

operations applied to the data.

We have in that respect, different
companies and different use cases

in our experience, have required
different architectures for that.

We've also worked on facilitating
these different architectures in

the work we've done with Timeseer.

For Instance sometimes it's useful
to have, let's say, a bronze,

silver, gold medallion architecture.

Oftentimes it's not as straightforward
as what I'm sketching here, but still

with different tiers of data quality,
and then having maybe data copied from

the raw into the silver, but with
some columns additionally added to it

that talk about quality context, maybe
validation, maybe repair context that

you then feed as you start having repairs
happening to the data, asynchronously.

Not Necessarily in the pipeline, but
happening of that in a delayed fashion.

Then you can already tap the raw
data, But you would see by this flag

that is still zero, that it has not
been repaired yet, or that it has

not been quality assessed yet, or
that it has not been validated yet.

That is an approach you might want to
take if you cannot tolerate any delays

that might be complicated repairs or
with human validation or things like that.

Sometimes people want it to be blocking
and want actually to have the data

not appear in the next layer before
the quality is being guaranteed.

So that depends a little bit.

I would say from use case to use case
and the criticality of what you do with

the data and how critical delay is versus
data quality because that is a trade

off you sometimes need to make for those
complicated delayed business processes

such as human validation, for instance.

Denis: Yeah, that makes sense to me.

I quite like the careful approach
of first saving it, and then we

can always modify it if need be.

Maybe before we move to the conclusion
of this episode, we can share a

couple of stories from the practice
of things we've seen in industry.

And I can start to give you
an example from my work.

We mentioned at the beginning
of this episode that.

We often work at the end of the pipeline.

We want to do some analysis
on machine learning.

We need the data, but then the
problem occurs much more upstream.

And I have an example of
this happening in practice.

I had to analyze data from a
cold mill and this data was saved

by Historian on a disk, on a
computer, plus some windows server.

Turns out that disk filled up two
years ago, and for the last two

years, no one replaced the HDT.

So data was never written.

So we didn't have the data
for the last two years.

And 'cause no one was looking for it at
the end of the pipeline, no one noticed.

Did you have any stories like this?

Thomas: yes.

Yes.

Absolutely.

So that, that story you just mentioned
makes me think of a case where we also had

a data historian and there were hundreds
of measurements that no longer seem to.

Receive updates, but still the
administrator was afraid to

remove them because he had no idea
whether those measurements are

still used somewhere, someplace.

So as a result, the company just
decided, okay, let's just keep

paying for these couple hundred tags
because we don't dare to touch it.

It's the opposite, is what you experienced
but it's a related related example.

So yeah, no, many things we've
seen go wrong over the past years

with working with customers.

In utilities for instance, We've
seen counters that are supposed

to count consumption counting
downwards and then still being

used in, in, in billing processes.

Have seen duplicate values affecting
the inputs of a prediction model that

fed into KPIs on a performance dashboard.

but then the KPIs were off because of
the duplicates values in the inputs.

So people only realized quite late
that was happening because you need

to have serious deviations sometimes
before people actually dare to challenge

values on a dashboard like that.

We've seen board reports on environmental
numbers being wrong, I'm not

sure they got fired, but at least
getting in trouble because of that.

we've seen incorrect set points being
provided via an optimization model

due to sensor drifts, for instance.

Because the sensors were starting to
drift no one really noticed, and then

the set points were actually wrong.

Sometimes that Same optimization
model also went down a couple times

because the input was no longer
available or was someone changed

the name and it just went down.

So those things happen if you
don't manage your data contracts.

There's not a case that might be
interesting to mention because it, I

found it's striking and I think it's
quite common in many process companies.

You have an ENI department that
ENI department knows a lot about

the reliability of the sensors.

Basically then they have
mounted many of them.

And in one case I remember talking to
an ENI engineer and he indicated that

they had a specific flow sensor, and he
was saying, yeah, okay, that deviates

about 10%, because of a mounting error.

Had made a mistake mounting it.

how I asked, does anyone else know
about this, then the answer was I

guess they know it by now or they
have probably realized it when dealing

dealing with it in the control room.

But there was no structural way of
transferring that information to

anyone else in the organization.

There is a gazillion problems
like this, we can talk about.

But yeah, people tend to trust data.

Especially if they're
not close to the process.

If you have an operator in a
control room, he knows how to

judge the data that he sees.

And sometimes he knows this cannot be
trusted and this is probably not right.

And he he corrects for it himself.

some of those problems,
workarounds have been applied as

well, more structurally there.

but Um, as soon as the data travels
a bit further from where it's

originating that context gets lost
completely and people just look at

the data and take it for granted.

Denis: Yeah, agree.

I so often manage the Databricks
environment where people ask me every day,

where do I find X or Y or, what's the name
of this particular series in that system?

So there's always a challenge when
you move data away from its context.

Well, Thomas, thanks.

I think you dropped a ton
of value for this podcast.

I certainly let a lot of things, I quickly
wanna calculate some things I learned,

and maybe you can add some of your owns.

I really like the way that we
separated the fact that various

companies have a different maturity.

I can imagine different people
listening will think, oh, we are

only that far in the journey.

So I think everyone will take
away what's relevant to that

particular location on the spectrum.

The second thing that I think we
learned that time series are not easy.

It's never just moving data from 0.8

to B.

There's always more problems to solve.

I also really like the proactive
approach of trying to fix data at the

source, even though you have spot all
the problems at the end of the pipeline.

And finally I also like the fact
that we split the problem into a

non-technical aspect, meaning people in
the organization, the whole aspect of

data governance need for data stewards,
we don't need to hire them, can be

someone who already is at your company.

That's the whole active of DataOps.

And finally, we delved into the
technical problems where I really

liked the fact that new technology is
just AI or other things should make at

least that part of the problem simpler.

And also, I liked your examples, I
found quite funny in a sad kinda way.

What about you?

Do you have anything to add to this list?

Any afterthoughts.

Thomas: what can I add?

Yeah, Maybe reassuring some of the
listeners that might not be as far in,

let's say, a data organization and
that don't identify with terms like data

stewards and even data quality in a sense.

Many manufacturing companies
do not have real data teams.

They have typical IT staff and
then maybe business experts.

True data engineering
roles are often missing.

So yeah, I want to reassure people
that you don't necessarily need

a full-blown organization structure
in that respect before you can start

thinking about about this topic.

and the second thing, I mentioned
the ambition level of data.

As you go from business insights
to, as a core focus to, business

optimization as a core focus to maybe
data monetization as a core focus.

And as you move from left to right
on, on that spectrum or from insights

to monetization, the value generated
or the value lost if you don't

deal with data quality is growing.

And so, depending on where you
are on the journey you will see

that topic pop up sooner or later.

Denis: Yeah, I think it's great
to mention, I mean, even if you're

not that far ahead, I think you
can still win by at least realizing

the importance of data quality.

'cause many companies are not even there.

Thomas: indeed.

It's also about sensor and asset
integrity of the lower levels.

We all like to jump to doing
big AI control stuff and

autonomous factories and so on.

But I'm surprised by how, much work
there still is in many companies on

the ground level of the quality of
the data that comes from the sensors,

the sanity of your base control layer.

Is that working correctly?

And some basic assets, how reliable are
they on the lowest level let's not forget

to focus on that in addition to those
bold initiatives of doing autonomous

factories and so on, but they will all
require that, that layer is reliable.

Denis: Yeah, that's a fair point
and could even act as a message

of hope for those companies that
don't have a data department.

At the end of the day, monitoring
your assets and ensuring that

they all work correctly and not
observed is something that you can

do with the people you have today.

Right.

Thomas: Absolutely

Denis: Great.

Excellent.

Thomas, is there anything else
people can find out about you, your

work about Timeseer, how can they
find out more about those things?

Obviously I will link
information in the show notes.

Thomas: I, think that might
be the best way you can

obviously reach out on LinkedIn.

I'll make sure that my LinkedIn
contact details and the ones of

the company are in the notes.

That's kind of the, the best
way to get in touch, I think.

If you want to talk to me you
can also email to Timeseer or

just connect to me on LinkedIn.

I'm quite accessible there.

If it's a topic that's close to your heart
or you just want to learn more or just

have a chat, I'm always open to that.

Denis: Excellent in that sense.

Thomas, thank you very and thank you
also to the listener for listening

to the second episode of the
Industrial Data Quality Podcast, and

we'll see you again in two weeks.

Thank you, Thomas.

Thomas: Bye

Denis: Bye bye-Bye.

Industrial Time-Series Data Quality and Reliability with Timeseer
Broadcast by