[{"content":" Motivations # It\u0026rsquo;s extremely easy to confuse people with subject matter expertise with people who really only have business process expertise. In other words, it\u0026rsquo;s easy for a person who knows nothing about hammers to mistake a hammer salesman for a carpenter. This type of mistake is common, and is detrimental to data and software projects.\nThere were two events that led to me writing this article. The idea for it came to me after a layoff at a previous job. Layoffs are almost always nonsensical when you look at the individual people who get laid off. But if you abstract up from trying to figure out why one person was laid off and not another, you\u0026rsquo;ll see some trends when you look at the people who survived and the job function they were performing.\nFurther motivation to write this came from a post on a popular tech community, where somebody said \u0026ldquo;I work as a developer for a software company in a particular industry. One major frustration I have is we don\u0026rsquo;t have anybody at this company who seems to know much about the industry we are building software for. So I don\u0026rsquo;t get real feedback on how useful things are. I wish we would stop hiring so many engineers and hire some industry experts.\u0026rdquo; I think this is a great idea, but it\u0026rsquo;s a bit naive about how hard it is to find an industry expert who can actually give you useful information, as opposed to giving you general industry knowledge, which is hard to build any specific solutions for.\nThis post is loosely tied to my previous one about Business Domain Expertise. You don\u0026rsquo;t have to read it to understand this article, but parts of that one lead into this one.\nThe Birth of a Business Process Expert # Imagine one day you start a new a job at a residential construction company supervising the process of building houses. When a home buyer wants to build a house, there are various steps that need to happen. The buyer select a plan, a blueprint is created, the materials are ordered, the different parts of the house are built (e.g. first the foundation, then the frame, then the pipes/wires, and etc), and so on. For every step of the way, there are various actions that need to be checked for completion and signed off on. Your job is not to own any individual part of this, but to ensure the checks are done, signatures are completed, and the right people are involved in every conversation and any escalations. There are a lot of teams and vendors involved in building a house, so it\u0026rsquo;s essential somebody is there to enforce the process around this, and to ensure big issues are properly tracked and do not fall between the cracks in process steps.\nYou do this for a few years and your reputation grows as an employee. You\u0026rsquo;re seen an expert in this business process, and you\u0026rsquo;re the go-to person to get issues resolved when there\u0026rsquo;s a problem. If somebody asks you about building houses, you\u0026rsquo;re able to eloquently talk about the whole process, pulling in things you\u0026rsquo;ve learned second or third hand about each individual step.\nOne day somebody reaches out to you from a software company. They build software for the residential construction industry, and they are hiring industry experts in residential construction to help guide them in product decisions and customer interactions. During your interviews, you meet a lot of people with a software background, none of which have any background in construction. When they ask you about your construction background, you describe the steps of the business process you own with a lot of detail. During your interviews, you\u0026rsquo;re clearly perceived as an expert on residential construction and you get the job. At this software company, you\u0026rsquo;re introduced as the residential construction industry expert, and your job title reflect this.\nThe only problem is, you\u0026rsquo;re not really an expert on residential construction. If somebody asks you how construction workers use blueprint software, you don\u0026rsquo;t really know. If somebody asks for detailed explanations on how vendors are paid, you can\u0026rsquo;t really provide much detail. If you\u0026rsquo;re asked how materials are ordered and how to simplify that particular process, you can\u0026rsquo;t really provide specific answers. You\u0026rsquo;re an expert in a particular business process in the residential construction industry, not really an expert in a specific part of the process.\nIf this software company is building a tool to streamline (e.g. to digitize the steps typically done by paper) the process you owned when you worked at the construction company, then you are the right expert. However, if the software company is building a tool for a specific part of the process, you can only provide second or third hand knowledge. You\u0026rsquo;re going to get questions outside of your expertise fairly quickly. If the software company wants to succeed, both sides need to recognize different knowledge is needed than what you can provide. They need different expertise.\nTo explain what I mean by this, let\u0026rsquo;s go back to the carpenter example. Modern construction uses prefabricated pieces, and the carpenter is responsible for assembly and making any necessary adjustments to install them. Let\u0026rsquo;s say under normal circumstances the time for installing one section is 3 hours. One day, however, a carpenter finds a discrepancy between the plans and the prefabricated materials, resulting in the work taking 7 hours. The first time this happens they just make the change, but then they start seeing this issue at other houses.\nThey raise this issue with their site manager. The site manager sees other carpenters at other houses are also having similar issues. The carpenters use their experience to provide ideas of which previous process steps could be the cause of the issue. They escalate to you, the process owner. You bring together all the relevant teams from previous steps (e.g. a different set of carpenters, the architects, the designers of the prefabricated pieces, etc) of the process and they figure out where the issue started. You then come up with a plan and timeline to fix the issue, you assign tasks to different people to ensure the fix is completed, and you communicate these changes to relevant teams.\nIn this scenario, your job as the business process owner is to make sure the right teams are involved in the discussion to find the cause of the issue, and to ensure a proper plan is put into place to execute the fix. It\u0026rsquo;s also your job to own this plan, ensuring every step is completed properly. It\u0026rsquo;s not your job to actually identify the cause and create a solution. The technical teams are responsible for determining which step in the overall construction process is causing the issue, and they need to provide the solution as well.\nSince you own this overall process, you are the only one who understands all the steps and the overall goals. But you don\u0026rsquo;t understand the the fine detail of every process step. The details of each process step are only understood by the technicians in each step. This is what makes you the business process expert, and what makes the carpenters the subject matter experts.\nMy point here is not to criticize or throw shade at business process expertise. I\u0026rsquo;m just trying to illustrate the difference between a process expert and somebody who is an expert at a particular step of that process. When enough people are involved, business process people are needed. If you see the process as a train, somebody has to care (or be paid to care) about the train, otherwise it\u0026rsquo;s guaranteed to go off the rails.\nThe Visibility of Business Process Enforcers # Let\u0026rsquo;s tell another story. You are the CEO of a company manufacturing nuts and bolts. Your nuts and bolts are used in a lot of different products made by many customers. One day a wizard appears in your office at 8 AM and says they are going to make 75% of your employees disappear. You have till 4 PM to choose who goes.\nAs the CEO, how do you prioritize who goes? If you want your company to survive, your goal should be to ensure the business keeps running and your customers get their nuts and bolts. So anybody who\u0026rsquo;s not involved in making and getting stuff to customers is going to get cut. Teams like new customer acquisition and Research and Development can all go.\nSince you are being forced to hack off parts of your company with a blunt object, something occurs to you. You might end up cutting out an essential part of your company and won\u0026rsquo;t realize until it\u0026rsquo;s too late. What you need to do is ensure the people who understand your operational processes don\u0026rsquo;t disappear. That way, if you lose an important team and part of your manufacturing and distribution process breaks, the business process person can quickly recognize this and you can try to find people to fill that broken link.\nYou reach out to all your upper level managers to figure out who the critical operational business process people are. At least in my experience, managers of managers don\u0026rsquo;t spend a lot of time reviewing the day-to-day details of what individual contributors are doing. Instead, they spend a lot of their time with process oriented people, focusing on longer term goals (e.g. is program X on track). While their job as a manager of managers might not be to enforce a business process, they certainly need to be oriented around business processes. And just like business process people, they don\u0026rsquo;t always understand the fine details of what is really done in each step of the process. As an end result, they might not have the ability to accurately discern a business process expert from a person who understands a single step of the process in enough detail to rebuild it.\nThis is where you see a flaw in this selection process. When someone is trying to figure out who can rebuild a particular step of the process, they might not know enough details to really understand what and who is required to rebuild something. The end result is when identifying the experts, process oriented people tend to see other business process people as experts because they don\u0026rsquo;t have the right experience and mental models to differentiate process experts from subject matter experts.\nThe consequence of this is while business process people can recognize when parts of the process are broken, they can\u0026rsquo;t always recognize why or how it\u0026rsquo;s broken and how to to fix it. You need subject matter experts in each process step to tell you how their part of the process is broken, and why other parts of the process are making this break better or worse.\nWhat is the Point of This Article? # Obviously this CEO story is an exaggeration. Many companies have been in the situation where they have to fire a significant number of their staff in a short period, However, even in fraud, bankruptcy, or private equity buyouts, it\u0026rsquo;s not something that happens in a day. Evil wizards probably have more dastardly things to use their magic for.\nThis article is mostly a reflection on how easy is it to misjudge expertise. There\u0026rsquo;s this concept called Gell-Mann Amnesia which presents the idea that while people think critically about a domain they understand well, as soon as they review information from a domain they do not know, they stop thinking critically and take stuff at face value.\nIf you\u0026rsquo;re trying to fix a specific, clearly identified, problem, it\u0026rsquo;s easier to find the right expert to fix it. But when you have a problem which crosses multiple team boundaries, it can be hard to tell the difference between somebody who is describing symptoms from somebody who has identified the actual cause(s).\nIf you are in this situation, in the best case there is a champion who genuinely wants to own the situation and fix problems. In the common case, people complain a lot and the work to \u0026ldquo;fix\u0026rdquo; it is mostly just moving charts around until a higher power intervenes. In the worst case, the business process owner is actively making things worse because their diagnosis of the issue is incorrect and they are solving imagined problems instead of the actual ones.\nThe leads back to the beginning of the article. Visibility is important, and business process owners have a lot of visibility. They are among the few people who regularly interact with everyone, and they are supposed to be the ones cleaning up the messes and misalignments. Everyone assumes they have some larger perspective no one else has.\nWhich is a shaky assumption if you\u0026rsquo;ve been in any job for long enough (e.g. 3+ years) and seen the reality of a business process. The are plenty of projects where almost every single person, including the process owner, has changed, sometimes more than once. Everybody can see the process they have to use and maintain, but nobody, including the process owner, knows why it was built that way in the first place or why it can\u0026rsquo;t be changed now. We just assume the train driver knows exactly what they are doing and really understands why they are doing it.\nWhen people can speak authoritatively and they have visibility, we assume they understand everything they are talking about. Compared to someone who understands a single part of a process, an overall process owner seems more knowledgeable. They\u0026rsquo;ve certainly got the ears of more people.\nMaybe in the grand scheme of things it doesn\u0026rsquo;t matter. If you work for a 6 billion dollar company and a 10 million dollar project is going off the rails, it can make more business sense to let it derail than to waste time fixing it. Some types of friction in multi-team projects spanning multiple business orgs are unavoidable, and you can\u0026rsquo;t fix them with infinite time, money, or people.\nBut if you really want to understand if the friction in that 10 million dollar project is fixable, you need someone who genuinely understands the causes of friction. Asking the person with the most visibility simply because they have the most visibility, and letting them lean into your lack of understanding of the situation, isn\u0026rsquo;t going to give you the right answer. You need to dig deeper to understand if what they are telling you is really true. Talk to the people who deal with the pain of that friction every single day, not the people who only feel pain in their ears and ego when some status indicator on a Gantt chart is red.\n","date":"30 May 2025","externalUrl":null,"permalink":"/posts/confusing-business-process-expertise-with-subject-matter-expertise/","section":"Posts","summary":"It’s extremely easy to confuse people with subject matter expertise with people who really only have business process expertise.  In other words, it’s easy for a person who knows nothing about hammers to mistake a hammer salesman for a carpenter.  This type of mistake is common, and is detrimental to data and software projects.","title":"Confusing Business Process Expertise with Subject Matter Expertise","type":"posts"},{"content":"","date":"30 May 2025","externalUrl":null,"permalink":"/posts/","section":"Posts","summary":"","title":"Posts","type":"posts"},{"content":" Introduction # As a Data Scientist, one popular piece of career advice is that you should have business domain experience (or expertise) and not just technical skills. But what is \u0026ldquo;domain experience\u0026rdquo; exactly, and why does it help differentiate one person from another? In this post I\u0026rsquo;ll talk about the different kinds of domain expertise I\u0026rsquo;ve seen at all the different companies where I\u0026rsquo;ve worked.\nWhen writing these articles, I try not to make too many assumptions about the reader. Consequently, I\u0026rsquo;ve tried to explicitly define some terms and concepts I use to explain the different types of expertise. Readers with career experience may find these explanations superfluous, but I think it will help clarify what exactly it is I\u0026rsquo;m referring to when I use terms like \u0026ldquo;the business\u0026rdquo; and \u0026ldquo;industry.\u0026rdquo; Students, people earlier in their careers, and people buried near the bottom of a 8 layer org chart with limited exposure to the larger business, might find the explicitness useful.\nDefining \u0026ldquo;The Business\u0026rdquo; # In this article, I use terms like \u0026ldquo;the business\u0026rdquo; and \u0026ldquo;business people.\u0026rdquo; To reduce confusion, I think it\u0026rsquo;s worth the effort to clarify what I mean by the term \u0026ldquo;business.\u0026rdquo;\nImagine you have a friend who sells different colored T-shirts. The entire company is this one person, and their current business model is ordering T-shirts in bulk from a supplier and selling them to customers via a simple website. One day they contact you (a Data person) and say their business is growing, they\u0026rsquo;re having inventory issues, and they need a dashboard to help them better track sales and also to help identify what color and size combinations they should be ordering and how often they should be ordering them. They\u0026rsquo;ve also noticed certain colors get purchased more frequently before local events, and they need help with tracking those events.\nIn this example, all the things your friend does including sales, ordering, inventory, packaging and shipping, etc. is \u0026ldquo;the business.\u0026rdquo; And what they\u0026rsquo;ve asked you to do is to support the business by building tools to help them make better business decisions.\nOne important point to recognize is the sales data from this business does not exist in a vacuum. It\u0026rsquo;s a byproduct of the decisions and the actions of the business owner. To truly understand the data, you\u0026rsquo;ll have to talk to your friend and ask questions. For example, let\u0026rsquo;s say normally they sell a shirt for $10, but there\u0026rsquo;s one customer who purchased a lot of them for $9. This seems like it might be from a coupon or sale, but the only way to figure it out is to ask your friend.\nAs T-shirt sales grow, your friend is not able to keep up with running the different activities required for the business. So they hire an additional person to handle sales and customer service and another person to run shipping and returns. The new sales person wants to sell new colors and also trial using discounts to boost sales of older inventory. They want you to build analytics to help them understand the impact of both these changes. The shipping and returns person also wants a dashboard so they can more easily track returns and understand if there are some larger issues being missed (e.g. some issue with a particular color).\nTo provide useful analytics to these new people, you\u0026rsquo;ll need to ask them what they want. The sales person might say what they really want is to understand if discounts have too strong of an effect in pushing people to buy older inventory, resulting in lower sales of the full priced shirts. The customer service person might ask for something you didn\u0026rsquo;t even realize they cared about.\nYou should realize that you, as a Data person, are not making business decisions. Those decisions are made by people in \u0026ldquo;business\u0026rdquo; roles who support a particular business function (e.g. sales). Your job is to support them in making those decisions. The data is not your data and nothing you do as a Data person is generating data. This company\u0026rsquo;s data is a byproduct of sales activity, inventory management, and customer service.\nAs the company grows, these single person departments become multiple person departments, and they add new departments like accounting and HR. Your friend the CEO decides to print T-shirts as well, so they hire dedicated staff to manage and run printing operations. At some point, the company is so big some departments don\u0026rsquo;t clearly understand exactly what it is the other departments do, but they all generate data as part of their operations and activities.\nWhen the company was run by only your friend, their business model was simple enough for you to understand the data on your own. But as the company grew, you needed more insight from each department to understand the data, and to understand what a useful analysis would be for each business person/unit. It\u0026rsquo;s important to recognize just having data is of limited value. Understanding how business people extract value from data, and use it to make decisions, defines the value of the data.\nNow that we have a better understanding of what I mean by \u0026ldquo;business\u0026rdquo;, and also what separates the activities of a a data person from the activities of somebody in a business role, let\u0026rsquo;s talk about domain knowledge.\nDomain versus Business Knowledge # One other distinction worth talking about explicitly is the difference between domain knowledge and business knowledge. These terms are used interchangeably in a lot of contexts, but it can be confusing when talking about data roles because data people have skills intersecting both of these.\nI\u0026rsquo;ll use something from my professional life as an example. I have a lot of experience building data pipelines, models, and dashboards based on files and datasets from manufacturing and industrial machines. This might seem like one skill, but it\u0026rsquo;s actually two. I have technical domain knowledge about ingesting and processing machine data formats, but I also have the business industry knowledge to understand the actual data and how it\u0026rsquo;s used by people on the business side of the table.\nDomain expertise can exist separately from business knowledge. It\u0026rsquo;s entirely possible to be an expert at ingesting raw manufacturing datasets, but not really understand the business operations generating the data, or how the business uses this data to make decisions. The opposite is also true. You can be an expert on a type of machine and have the knowledge to make high value decisions based on the machine data, but you don\u0026rsquo;t have the software domain skills to use the data in it\u0026rsquo;s raw form. You need somebody (e.g. a Data Engineer) to transform it into a format you find usable.\nOne of the big differences between Data Scientists/Analyst and Software Engineers is domain versus business expertise. Software Engineers are domain experts in software. They typically also have additional subdomain expertise (e.g. image processing), and they can have highly specific industry expertise (e.g. processing 3D scan data from medical devices), while having limited knowledge of how business people run their operations and make decisions. In contrast, Data Scientists/Analysts need to have knowledge of the business because what they build needs to be useful to the people who make decisions based on that data.\nThe Different Kinds of Domain Knowledge # Speaking broadly, let\u0026rsquo;s split domain knowledge into four categories. 1) technical skills, 2) industry expertise, 3) internal company knowledge, and 4) business process knowledge. It\u0026rsquo;s important to recognize the expertise people develop in their careers are an intersection of at least two of these categories. Once you start a job, technical skills do not exist without the context of the industry the job is in, even if that industry is the software industry. Internal company knowledge and business process expertise feed each other, and they are defined by the industry they are being formed within. Technical skills and business process knowledge go hand-in-hand, as you can imagine the development, production, and deployment requirements of medical software are very different than a photo viewing app.\nThese categories are artificially separated for the sake of this article. If you are reading about one of them and you feel like the examples I\u0026rsquo;ve provided fits into more than one category, you\u0026rsquo;re probably correct.\nTechnical Skills # Technical skills are the basic \u0026ldquo;hard skills\u0026rdquo; you need for any Data Science role. There\u0026rsquo;s a lot of variance in what these skills are because every industry is different, but these are the fundamental skills required to be an individual contributor. Examples of technical skills are SQL, basic statistics, and a programming language (R/Python).\nThere are also industry specific technical skills. For example, if you are working in marketing trying to understand the impact of your advertising spending, there are specific types of models (e.g. Bayesian Regression and Casual Modeling) you need to know. If you work with sensor data, it\u0026rsquo;s useful to know about signal processing and methods to deal with sensor noise.\nTo some degree, you can learn industry specific technical skills from a textbook or course, but there is a point at which only real world experience can teach you how to properly apply these skills. And as you get more expertise with applying your knowledge to real use cases, the faster and more efficient you\u0026rsquo;ll become at your job.\nHere is an example of where real-world experience enhances your technical skills. Time Series Forecasting models (e.g. ARIMA, GARCH) have been around for a long time and there is plenty of material to learn about them and when and how to use them. Picking the right parameters is a bit of art form, and one tactic inexperienced people use is to simply brute-force all the parameters and pick the model with the smallest errors. However, there are sometimes industry or team specific guidelines for certain parameters to make things easier. Someone I know who worked in financial modeling told me people in finance typically set one particular parameter of the model to 1 because it doesn\u0026rsquo;t make much sense to set it to any other value when you try to interpret the model results. This made a lot of sense when it was explained to me, but I wouldn\u0026rsquo;t have even thought of it on my own.\nIndustry specific technical knowledge isn\u0026rsquo;t limited to statistical modeling and machine learning. Having this type of domain experience can also simplify and accelerate how you think about data schemas and data engineering. An example of this is charts. In the real world, the types of charts people use can be very domain and even industry specific. This means you might find yourself building different versions of the same chart a lot of times. For example, if your users always want comparison charts (e.g. give us weekly sales of product A vs B vs C), you\u0026rsquo;ll quickly discover BI tools work better with tall data (having a single product column and a single sales column) versus using wide data (having a separate sales column for every product), and create your schemas and pipelines accordingly. Based on what users expect, you\u0026rsquo;ll also figure out what calculations should be done in the database, which should be done in the data pipeline, and which can be done in the BI tool.\nTechnical Skills and Science Knowledge # One area I won\u0026rsquo;t cover in this article are roles where Machine Learning knowledge is secondary to a strong grasp of theoretical and applied science (e.g. physics, biochemistry, electrical engineering, quantitative and qualitative social sciences). At least from what I\u0026rsquo;ve seen, these roles don\u0026rsquo;t have a \u0026ldquo;Data Scientist\u0026rdquo; title. Instead, you\u0026rsquo;ll see titles such as \u0026ldquo;Computational Biologist\u0026rdquo; and \u0026ldquo;Signal Processing and ML Engineering\u0026rdquo; which emphasize the science or engineering focus. Even jobs with titles like \u0026ldquo;Robotics AI/ML Engineer\u0026rdquo; are organizationally much closer to research and development than they are to the business.\nThe skills required for those jobs are very different than the target audience of this article so I won\u0026rsquo;t cover that type of domain knowledge here.\nIndustry Expertise # Industry Expertise comes from understanding the industry you are in. If you are in a technical individual contributor role, what you learn about your industry will be specific to the type of business function you support. For example, if you work in a sales organization, you\u0026rsquo;re going to learn about how the sales cycle works and what really drives sales. And even in sales, you\u0026rsquo;re going to learn very different things if the product your company sells is a high-touch (e.g. selling to government) product with a long sales cycle, or a low-touch SaaS product (e.g. selling a $10/month API based product to software developers).\nIndustry Expertise has both technical and non-technical components. To better illustrate the difference, here are two examples.\nTo test pricing and the effect of product placement on store shelves, grocery stores and product manufacturers (e.g. a company making laundry detergents) run experiments in simulated and actual stores. Analytics people have the technical skills to analyze the data from these experiments. But to run the actual experiment you need people who understand the grocery business and the methodologies and processes of running these tests in the actual physical world.\nWhen you first start working in this type of job, you might have the technical (software + statistics) skills to process and analyze the data in a general sense, but you won\u0026rsquo;t have the industry domain knowledge about the processes generating that data. The more time you spend in this domain, the more exposure you\u0026rsquo;ll have to the non technical aspects. You\u0026rsquo;ll better understand what the non-technical people do, what their pain points are, and what they need to do their job more effectively. This exposure will allow you to better adjust your technical work so what you create is consumable and useful to the business people.\nFor the second example, I\u0026rsquo;ll take my own experience. I\u0026rsquo;ve worked a lot with teams who operate and service industrial and commercial machines. From a technical standpoint, I\u0026rsquo;ve learned a lot about machine and service data and the types of methods and models used to analyze this data. From an industry perspective, I\u0026rsquo;ve learned a lot about the KPIs and insights business users are looking for.\nThese insights and KPIs also vary between service teams. Some machines run in remote areas or underground, and there\u0026rsquo;s no real way to perform physical maintenance on them. So the goal is to avoid a catastrophic failure because it\u0026rsquo;s very difficult to replace a dead machine. The models used and KPIs for this scenario might be very different from machines which run in a chemical processing facility, where it\u0026rsquo;s easier to access the machine. In this latter case, using predictive maintenance to maximize machine runtime is more of a concern than a catastrophic loss.\nSomething I\u0026rsquo;ve found interesting is it\u0026rsquo;s entirely possible to have an expert level understanding of important metrics while not really knowing how to recreate those metrics from raw data. The opposite is also true. There are people who are experts on the raw data and transforming it into KPIs, but they don\u0026rsquo;t know how to provide the right context to those KPIs to make a business decision. I\u0026rsquo;ve been in the latter position many times, and I think Data Engineers are generally the same way. I mostly write this paragraph so people realize industry \u0026ldquo;expertise\u0026rdquo; isn\u0026rsquo;t a binary category (e.g. you are an expert or you aren\u0026rsquo;t), but it\u0026rsquo;s sometimes a seemingly contradictory sliding scale. You can be considered an expert at something without knowing every level of it.\nFor Data Scientists and other Data roles, one challenge is how much industry knowledge you learn is highly dependent on how much exposure you have to the business side and customers. It\u0026rsquo;s entirely possible to be buried so far down the org chart you don\u0026rsquo;t understand your industry very well at all. I\u0026rsquo;ve seen this happen more frequently to people who are in an Engineering (e.g. Machine Learning Engineers) org, where they are managed by a user story and subtasks, as opposed to a role where your projects are primarily led by people on the business side.\nInternal Company Knowledge # After you\u0026rsquo;ve been in a job for a while, you\u0026rsquo;ll start acquiring internal knowledge. It\u0026rsquo;s not always easy to see this, but internal knowledge is a superpower. There\u0026rsquo;s this concept of a 10X engineer, and I think it\u0026rsquo;s easy to assume a 10X person has to be a genius like Ramanujan, Mozart, Carmack, etc, therefore you can\u0026rsquo;t be a 10X person. But many 10X people become that way after mastering internal knowledge and systems, or having a superior understanding about the way different parts of internal systems overlap. If your specialized knowledge of 3 internal systems means you can identify an issue in an hour, whereas it would take most other people days, you\u0026rsquo;re a 10X person.\nIn my view, internal company knowledge breaks down into four broad categories a) knowing people b) knowing where to find things c) knowing your company\u0026rsquo;s data and technical systems, and d) knowing what drives the business and how to measure it.\nKnowing People # Knowing people is really about knowing the right person to ask when you need help with something. That \u0026ldquo;something\u0026rdquo; could be data access, help with understanding how some business goal is measured, technical expertise, or even knowing the person who seems to always know the right person/team to ask.\nThe longer you are at a job, the more the \u0026ldquo;help\u0026rdquo; you need from others shifts from a concrete thing (e.g. who can give me access to this), to soft-skill topics like \u0026ldquo;How can I motivate you to prioritize this fix.\u0026rdquo; Knowing the right people to talk to is really important to doing your job effectively.\nKnowing Where to Find Things # \u0026ldquo;Things\u0026rdquo; in companies are documentation, wikis, SharePoint sites, customer service tickets, manuals, release notes, etc. LLMs and techniques like RAG will make this easier, but internal information is usually spread out all over the place. And every org in every company does things differently, so it takes time to get a good mental model of where answers might be when you have a question.\nKnowing the right people is part of this because the right person can quickly tell you where to find something. I have a very distinct memory from one my jobs where I was fumbling around trying to figure out how to integrate two pieces of software, and somebody I was working with said \u0026ldquo;Well I know at some point they announced our product supports that integration, so it\u0026rsquo;s probably in this manual\u0026rdquo; and sure enough, it was fully documented in that exact manual. My prognosis on what what we were trying to do went from impossible to easy in about 2 minutes.\nKnowing Your Data Systems # Knowing your company\u0026rsquo;s data is a really broad topic, and it\u0026rsquo;s going to differ from company to company, and every from org to org (e.g. the data systems used by the Sales team are going to be very different from what the Engineering team might use) Examples of knowing your data includes\nKnowing the right database and tables to use for the questions you are trying to answer Knowing the right columns to use in an individual table Knowing the right tables to join together, and the right way to join them Knowing how to join tables across databases so they align properly Knowing the right data tools to use for whatever task you are trying to do Knowing the recommend way to connect together all the data tools you are trying to use. This assumes there are internal best practices around this. There are many places where the data infrastructure is hodgepodge of tools and people just pick the ones they prefer. It can lead to very brittle data stack. Knowing how to deal with all the brittle and weak links in the companies data stack. For example, you might learn a certain database is extremely slow during working hours, so it\u0026rsquo;s better to export the data in the middle of the night and then use the exports during the day. When you start any data role, it can be quite unsettling and overwhelming to open a database and see hundreds of tables and columns with completely undecipherable names. This compounds if you have multiple databases with different schemas, and you have no idea how any of this can possibly fit together. Over time you\u0026rsquo;ll figure all this out and one day you\u0026rsquo;ll be the expert people go to when they are trying to find the right data to answer a question.\nOne quirk with data expertise comes from how common it is nowadays for people to change jobs, even if it\u0026rsquo;s an internal change. It\u0026rsquo;s very possible one year after you start a job, you are already perceived as an \u0026ldquo;expert\u0026rdquo; for a particular set of data, even if you still feel like a novice. It\u0026rsquo;s important to recognize when you are an expert compared to your peers and embrace that. You need to be able to understand your value.\nIn a technical data role, knowing your data usually means you know about your technical infrastructure as well. This is because the technical work you do (e.g. write data pipelines in PySpark) is usually intertwined with the technical infrastructure you use to perform that work. So unless your job is 100% writing SQL in a web browser, learning about your data systems means you learn about the surrounding technical infrastructure as well.\nKnowing the Business # The last category is knowing the business. This is a very fuzzy concept because \u0026ldquo;business\u0026rdquo; knowledge overlaps with industry knowledge, and internal knowledge is knowledge about the particular business your work for. For example, the KPIs and practices used by a particular business function (e.g. Manufacturing) are typically based on industry standard KPIs and practices.\nKnowing the business is really about understanding what motivates the business people you work with, what they are looking for, and what drives the business. To understand that well, it\u0026rsquo;s very helpful to understand business processes. But before we get into business processes, let\u0026rsquo;s first talk about how technical and data knowledge are relevant here.\nWhen business people ask a Data Scientist or another Data person for help, there\u0026rsquo;s usually some metric or KPI they are trying to effect or improve. When you work with a business person to understand how that metric/KPI can be moved, you\u0026rsquo;ll discover what business activities move the metric and how these activities move it. And as you start understanding those activities, you\u0026rsquo;ll understand the business you are working in.\nThis business understanding is really important because it allows you to discover what business activities you can influence. For example, maybe a KPI is based on the combination of four separate, but intertwined, business activities. Your business partner says one of those activities cannot be changed, and you discover you don\u0026rsquo;t have the data to really measure or influence one of the other activities. These leaves you with the other two, and you now know where to focus your efforts.\nTo better illustrate the idea of only focusing on business activities where you can actually make an impact, here are two examples. First, let\u0026rsquo;s look at the commercial service industry. This is where some equipment breaks and somebody goes out to fix it. In theory, the various activities (e.g. speaking with a customer about an issue, assigning somebody to fix it, ordering parts, somebody actually going out to fix it) are common across the service industry. But individual companies will make decisions about how they go about these activities based on their own constraints. For example, your service department might only accept service calls with a minimum of 4 hours of work. This could mean you bundle together a lot of small (e.g. 30 minute fixes) projects into one customer visit, but you won\u0026rsquo;t go to a customer site for a single small project. If you\u0026rsquo;re developing models, you might realize it\u0026rsquo;s not worth the effort to develop certain types of models (e.g. predicting a small fix) because the service department will refuse to act on them. If you know this before you build the model and you focus your work accordingly, that\u0026rsquo;s an example of using business knowledge to make better decisions.\nAnother example is marketing activities. Marketing, in a general sense, has certain activities performed by all companies. However, different marketing departments can choose to outsource certain parts of what they are doing, and this will influence how much analytics built by internal teams can really influence and drive decisions. I used to work for a large company with many brands. Each brand had it\u0026rsquo;s own business unit, and the one I worked for had around 5% of the total sales of the company. This directly impacted the size of our marketing department, and what activities could be done by our team versus asking for support from another brand or giving it to an external vendor. One of these activities was Marketing Mix Modeling (MMM). In theory, our small Data Science team could have tried to do it. But it was more pragmatic to give this to an external vendor so we could focus on topics with a shorter time impact.\nLet\u0026rsquo;s tie this idea of business activities back to KPIs and metrics. In the service example, the 4 hour minimum is not a completely arbitrary decision. There is probably a metric to calculate the profitability of service visits, and service visits of less than 4 hours drives that metric below a target. This profitability metric may be in contrast to another customer satisfaction metric, which is negatively affected by this 4 hours minimum requirement , but not enough to change 4 to another number. As a Data Scientist, you should understand the activities which drive these metrics. The business people you support are evaluated based on these metrics, so it\u0026rsquo;s your job to work with them to identify how your work can improve what they do.\nIn the marketing mix example, I said it was more pragmatic to give it an external vendor. Let me explain why. When people make expensive purchasing decisions (e.g. a car, kitchen appliances, an expensive purse), brand perception and recall matter. Even if you have a great product, people will be hesitant to buy it if they don\u0026rsquo;t know or trust your brand. When they are cross-shopping and thinking of what brands to check, people may not even remember your brand or product exists.\nThere are industry standard KPIs for brand perception. (Google search for Brand KPIs)) Marketing departments strive to improve their brand\u0026rsquo;s ratings on these KPIs, but simply marketing is not enough. Brand perception is built over time, and it requires reliable products with the perception of high quality. Customer service is important as well. If your brand has no reputation or a poor one, it takes years of executing well on all fronts to improve the brand KPIs. Brand perception has a direct effect on Marketing Mix Modeling. If nobody trusts your brand, throwing money at certain advertising channels isn\u0026rsquo;t going to meaningfully improve sales or brand KPIs. You can\u0026rsquo;t optimize your way out of a mediocre reputation.\nA small Marketing org with a small Data Science team is unlikely to be able to move those brand KPIs. It\u0026rsquo;s a long term effort by the entire company. Even though MMM is very interesting and it\u0026rsquo;s frustrating to give it to an external team, there are other marketing KPIs with a shorter time horizon where you can demonstrate improvement. It\u0026rsquo;s only by understanding the company\u0026rsquo;s marketing business that you, as a Data Scientist, are able realize you can have more of an impact by focusing on other problems, rather than trying to optimize advertising spend.\nEvery company and organizational unit is going to have different business goals, KPIs, and metrics. Sometimes the metrics between orgs in the same company are contradictory! As a data person, knowing what drives these measures gives you a unique insight into what the business teams can and cannot do. This, in turn, gives you you the small superpower of being able to quickly diagnose what business people really want (what needle are they trying to move?) out of a use case, and determine the feasibility.\nBusiness Process Knowledge # Any sort of business activities (e.g. marketing, sales, engineering, manufacturing) will be composed of many individual steps. Established companies are going to have business processes for performing these steps, and for coordinating between different teams and organizations. As a Data Scientist, you\u0026rsquo;re going to have to follow these processes. Depending on how your Data Science function operates with other teams, you\u0026rsquo;ll probably also be involved in improving existing processes or creating new processes. Having a process helps set expectations and timelines, it protects the organization from focusing on low value work, and it reduces the friction between the Data Science team and the other teams involved in projects.\nIn my view, there are two broad categories of business processes. The first type is based on the organizational structure of the company, and it\u0026rsquo;s largely internal knowledge. The second type is organized around the industry forces the process is operating in. Many processes are a hybrid of both.\nA common example of the first type is getting approval for a new laptop or monitor. The complexity of this simple sounding action is going to vary a lot based on company guidelines and processes. Another example is choosing between project management frameworks such as Scrum, Kanban, Scaled Agile, etc. In theory people pick the one best suited for the product they are developing. In practice, internal organizational structures and boundaries define which one is used, dictating a lot of how you work.\nAn example of the second type is when a company is making products for the medical industry. Typically, there are regulatory requirements influencing the process for releasing models to production. The closer the model is to influencing decisions made about patients, the more the regulatory process is going to drive the model testing, validation, and release process. Considerations of legal consequences will influence processes as well.\nThe type of customers being sold to also influences business processes. If you sell a software product to other businesses, your process around testing and deploying models is going to be different than if the company you work for has a \u0026ldquo;free\u0026rdquo; software product primarily funded by advertising. Processes will be different between a hardware and software product as well, especially if you are comparing hardware that should never get things wrong (e.g. controls on a airplane) versus a basic physical fitness tracker.\nSpeaking more specifically about Data Science, a concrete example of a business process is how you vet and triage requests for predictive models from the business. If you are a small data team and you have a lot of requests for predictive models from different teams, it\u0026rsquo;s useful to have an established process for how the business teams can propose projects and provide a financial justification for their requests, and also define what they need to provide in terms of support (e.g. they need to provide an business domain expert). Having this type of process ensures everybody, inside and outside the Data Science team, is willing to invest their time and money into a project. It helps to keep things fair.\nOnce a request passes this approval process and a Data Scientist starts working on it, it\u0026rsquo;s useful to have some process to continuously validate the feasibility of the project and have criteria to pause it. This is so you don\u0026rsquo;t waste time trying to get something to production, when it was clear 3 weeks in you didn\u0026rsquo;t have the data to support it.\nIn addition to processes about how you work with the business, there are technical processes and best practices as well. These will dictate how you develop, test, deploy, and release your code. This includes things like how to write documentation, who has to sign off on documentation, etc.\nLet\u0026rsquo;s wrap up this section by looking at processes from a day-to-day job perspective. At a high level, there are three types of activities you will do. You will create business processes, you will enforce business processes, or will follow business processes. Since many Data Science teams are part of supporting a larger business activity, they are always doing the last one. In smaller teams, you should be doing the other two activities as well. I\u0026rsquo;ll talk more about this later in the general career advice section.\nThe Impact on Your Career # The Intersection of Technical Skills and Industry Knowledge # I don\u0026rsquo;t think I need to explain why having technical skills is important to being successful in a Data Science role, or Data roles in general. So instead of having a separate section just on technical skills, let\u0026rsquo;s talk about the intersection with industry knowledge.\nOne thing I\u0026rsquo;ve seen a lot is people refusing to learn statistical or ML techniques specific to the industry they\u0026rsquo;re working in. For example, people working on forecasting problems, but refusing to learn time series models (e.g. ARIMA) because they can get \u0026ldquo;acceptable\u0026rdquo; results from a more general purpose ML or Deep Learning model.\nI do understand in many Data Science jobs, nobody cares how you did it, they just want you to do it. So it\u0026rsquo;s easy to feel like you have no time nor motivation to use a model which requires more manual tuning. But I would really encourage you to learn about the industry specific techniques, because it\u0026rsquo;ll teach you a lot about how these problems are formulated and how people think about them. It\u0026rsquo;ll also give you a better vocabulary to use when communicating to non Data Scientists about topics like risk and how much you can trust your models.\nI\u0026rsquo;d also encourage you to think this way about industry specific technology products that you have to interact with, but not necessarily use directly. An example of this is products from SAP. SAP products are used a lot in manufacturing and other service oriented environments. As a Data Scientist, you might never query an SAP system directly, but might be ingesting SAP data into your project from a data warehouse. This is true of many other proprietary and open-source products as well. They are part of the software stack somewhere that generates data you use, but you never interact with them.\nYou don\u0026rsquo;t have to learn to use or administer those products, or do any formal training. But I\u0026rsquo;d really recommend that if you have the opportunity, you have somebody walk you through the process of how that product is used as part of their jobs. This will give you a lot of context in understanding how people work and the business model they are working in. Technology solutions are a representation of the business activities people are performing.\nSince I already mentioned SAP, I\u0026rsquo;ll use it as an example. I\u0026rsquo;ve worked with a lot of service data. Like most data, it has a lot of quirks, but I was able to work with this data for years without really understanding how it was generated. Then, during one particularly tough project, I reached out to someone and asked if they could show me the SAP ERP UI they used every day. They did me one better, and actually walked me through exactly what happens when a customer calls in an issue, when a ticket is created, and showed me the information for an existing ticket.\nIt was absolutely eye opening. Not only was I now able to understand where all the quirks I saw were coming from, I better understood the whole process of what the service folks did in their job, and how the data reflected that. I also got a lot of insight into how technicians looked at data, and how what our team gave them (e.g. predictions and documentation) could be better put into their language, and not just ours.\nAnother major benefit was allowing me to get a better idea of the service industry in general. Industry specific software systems strive to be an approximation of what people in that industry are doing. Understanding the user flow in these systems provides insight into what people really do in their jobs. Sometimes it\u0026rsquo;s a very poor approximation, but purely from an educational standpoint, it\u0026rsquo;s better than nothing.\nEvery industry or function has technology like this. Salesforce, Workday, Servicenow, Healthcare EMR systems, JIRA, physical control systems, etc, are all examples. Regardless of if people love or hate those systems, I encourage you to try and find the interest and mental justification to understand how people use them.\nHow Useful is Internal Knowledge? # Internal knowledge exists in two simultaneous contradictory states. On one hand, strong internal knowledge makes you an expert, and you are perceived as valuable. At the same time internal knowledge can feel fairly useless when you have to wrestle with seemingly arbitrary things on a daily basis. I used to work at a place with a lot of different tables across different databases, and the column names were not always informative. Documentation (if it existed at all) was in different places, and sometimes outdated. Anybody using this data had to invest a large amount of time and mental space to understand it.\nThere were frustrating days when I really questioned how much of this knowledge was useful to my career. If I changed companies, or even just changed my role into a different org, a significant portion of this knowledge would be of no value. At the same time, one day my manager messaged me saying something had gone wrong in production and the cause was unclear. Could I help fix it? And sure enough, my knowledge of the system meant we were able to quickly find the interaction between the data and the code causing the issue.\nWhen you\u0026rsquo;re trying to understand the value of internal knowledge, it\u0026rsquo;s important to abstract up from the particular thing you are dealing with and figure out what the larger, more general, skill is. Instead of of seeing your data infrastructure as a huge mess, and perceiving your expertise in that mess as useless knowledge, understand data infrastructure in general is a huge mess. And the more established the company, the bigger the mess is. Learning to have the patience and curiosity to navigate and untangle that mess is a valuable and transferable skill. And to learn the more general transferable skills, you need the specific experiences you have had untangling your particular data mess.\nThis is true for non-technical internal knowledge as well. Being able to recognize when it\u0026rsquo;s better to reach out for help instead of continuing to try to figure something out yourself is a skill. When you look at types of people, some people are always willing to help. Others will only help if they perceive a problem as worth their time. Some people need a managerial intervention before they even acknowledge you exist. Every company has a different culture. But unless you know the people and places to reach out to, and how and when they can help, you\u0026rsquo;ll never build the internal compass to know when to do what. Allow yourself to recognize the value of non-technical internal knowledge, and remember it\u0026rsquo;s takes time to learn it.\nLearn as Much Business Domain Knowledge as Possible # I hope I was able to clearly illustrate the value of learning about the business in the dedicated section under Internal Company Knowledge. Instead of reiterating this point, let me talk about one of my regrets from a career growth perspective.\nWhen I was in consulting, I met with lots of people in the business. But because I was normally heads down and under time constraints, I usually only asked enough about the business side of things to get my work done. It\u0026rsquo;s like I had access to an entire encyclopedia, but I was only focused on the pages related to my work.\nAs you progress in your career, knowing the business becomes increasingly important. If you look at technical and non-technical job ladders, interacting with people and making sure your team is pointing in the right direction becomes more important than solving a very specific technical problem. You need to be able to understand the business to know when you are pointing in the right direction.\nIf you don\u0026rsquo;t understand the business you are working in, search engines and LLMs can help you find an astonishing amount of information about your industry, and even your specific business activities and operations. I just typed in \u0026ldquo;How is an oil well drilled?\u0026rdquo; and \u0026ldquo;What data is collected when an oil well is drilled?\u0026rdquo; into an LLM. The response was fantastic, and gave me a detailed overview of drilling and drilling data. I\u0026rsquo;m not saying you should trust these answers, but it gives you a great starting point to have a conversation with a business person without having to start from zero. Use what you find to bootstrap your conversations and learn about the particular company and people you support.\nSurviving Business Processes # A project manager I liked working with once interrupted a meeting by saying \u0026ldquo;I just wanted to say this format wasn\u0026rsquo;t my idea, and I proposed something else. But it\u0026rsquo;s what we have to work with.\u0026rdquo; That sums up a lot of the business processes you\u0026rsquo;ll see as a Data Scientist. Learn it, ask questions if needed, and try to follow the process without getting too frustrated.\nThat being said, don\u0026rsquo;t automatically become a victim of business processes. The world of Data Science is very new, and many people don\u0026rsquo;t really have the knowledge to realize when some part of a process just isn\u0026rsquo;t working well for a Data Science project. This is because they are applying incompatible ideas from other domains to data processes.\nYou may not like to do this, but sometimes to fix a part of the process, you have understand all the parts of the process and where they came from. This allows you to get the right context, and then figure out the best way to explain how broken it is and demonstrate possible ways to fix it. Don\u0026rsquo;t just keeping complaining about how things make no sense. Asking questions and constructively pointing out problems will help good people realize the process no longer serves the intended purpose.\nIt\u0026rsquo;s astonishing how many times the answer to \u0026ldquo;Why do things this way?\u0026rdquo; turns out to be nobody seems to know why, and the person who came up with this process left years ago. Those answers mean there a possibility to change things because nobody has a strong reason to block you outside of \u0026ldquo;We\u0026rsquo;ve always done it this way.\u0026rdquo; It also means if you have to make a change first and ask for forgiveness later, nobody really has a strong reason to push back.\nIn the previous section about this topic, I had mentioned in addition to following processes you would be creating and enforcing processes as well. From what I\u0026rsquo;ve seen, a Data Science team is unlikely to create a process dictating how cross functional teams work. This is usually reserved for someone with a managerial title (e.g. project/program/director). What you can and should do is modify or create sub-processes to help your data science work. Create a process, template, or rubric to validate new and ongoing projects, and for evaluating when a model is good enough to go into production. The process should have steps for your business partners so it forces them to do something and have some accountability.\nWhile these processes can slow you down, it also allows you to enforce them when you need to. There are no shortage of project ideas, and there are people who will constantly push for you to work on your pet project, even if you can\u0026rsquo;t validate the business value. Having a process means you can say \u0026ldquo;Based on this process, we can\u0026rsquo;t move forward\u0026rdquo; and point to your process as a more objective measure for saying no.\nLike many others, I find many business processes frustrating and sometimes nonsensical. This is especially true for processes created outside of your team. I do, however, realize they exist to to try and prevent bigger problems. I think it\u0026rsquo;s worth the effort to come to terms with why business processes are needed, and to develop a proper understanding of the processes directly effecting your job. People will take you more seriously if you actually understand the processes you are a part of.\nInternal Team Dynamics - Management Issues or Process Issues? # Something I\u0026rsquo;ve purposefully not talked much about are internal processes created by the Data Science team for the Data Science team, not other teams. At least in my view, these aren\u0026rsquo;t rigid processes as much as they are best practices and checklists. If you\u0026rsquo;ve got a healthy team that works well together, you trust each other to do the right thing and learn from mistakes. If a checklist or best practice needs to be updated, you discuss it and change it without needing input from outside the team. However, if you\u0026rsquo;ve got team issues or a problematic teammate, creating rigid processes can become a dysfunctional way to deal with it, since it forces people to do things in a certain way. It also allows you to point the finger if someone doesn\u0026rsquo;t follow the \u0026ldquo;process.\u0026rdquo; Personally, I feel like these things are people management issues, not really process issues. I know I\u0026rsquo;m being optimistic here. Sometimes you\u0026rsquo;re forced to create these types of processes because politics prevents you from doing what you really need to do. Still, I don\u0026rsquo;t think these issues fit under the idea of a business process.\nNot Trying To Cover It All # I\u0026rsquo;ve written this article from the point of a view of a Data Scientist. Data Science is really broad, so it\u0026rsquo;s really from the point of view of a specific type of Data Scientist. If you feel like I\u0026rsquo;ve missed the categories and domains you\u0026rsquo;re familiar with, you\u0026rsquo;re right. There\u0026rsquo;s a entire universe of work, functions, and job roles I\u0026rsquo;m not familiar with. I haven\u0026rsquo;t tried to cover it all.\n","date":"15 May 2025","externalUrl":null,"permalink":"/posts/what-is-business-domain-experience/","section":"Posts","summary":"As a Data Scientist, one popular piece of career advice is that you should have business domain experience (or expertise) and not just technical skills.  But what is ‘domain experience’ exactly, and why does it help differentiate one person from another?  In this post I’ll talk about the different kinds of domain expertise I’ve seen at all the different companies where I’ve worked.","title":"What is Business Domain Experience?","type":"posts"},{"content":" Establishing Context # When you develop solutions for predictive maintenance, one of the challenges is being caught between the people who design/build the machines, and the people who service machines. Whatever you build has to provide business value without stepping on the feet of either of those parties. In this article I\u0026rsquo;m going to try and provide insight into what it means to walk this line.\nBefore I jump into the main topic, I want to make sure readers have the right context to understand where this article is coming from. Let\u0026rsquo;s define the word \u0026ldquo;Engineering\u0026rdquo; as it\u0026rsquo;s used in this article, and also discuss the type of predictive maintenance use cases I\u0026rsquo;m referring to.\nFirst, let\u0026rsquo;s address the word \u0026ldquo;Engineering.\u0026rdquo; Typically when people discuss predictive maintenance, they\u0026rsquo;re referencing a piece of hardware. This can be a piece of manufacturing equipment, a piece of medical equipment, etc. It consists of both hardware and software components. The Engineering team is responsible for designing both the hardware and software and ensuring those components work together smoothly.\nIt\u0026rsquo;s important to realize these machines need to work predictably and reliably in commercial or industrial settings. Designing them takes time, and you cannot release a product to a customer with the promise of basic features coming in the future. Any software or hardware updates to these machines need to be done with a lot of testing, rigor, and come with comprehensive training. Imagine if you operated an MRI machine at a hospital, and one day the user interface suddenly changed (due to a software update) in a way you did not understand, and you had cancel all patient scans. The impact of any update to these types of machines can be very severe, so they have to be implemented thoughtfully.\nSome Engineering organizations don\u0026rsquo;t have these type of real-world constraints. After an overnight update, a mapping app on your phone might have a new interface. While this is annoying, you\u0026rsquo;ll probably take the time to figure it out, or you\u0026rsquo;ll use another mapping app until you figure out this new interface. In this article, I\u0026rsquo;m not referring to the Engineering orgs building this type of software product. I\u0026rsquo;m referring to Engineering teams who build commercial devices where the real-world impacts (e.g. losing millions of dollars because you have to stop a factory) are well beyond alienating your users and reducing your revenue stream.\nPeople reading this article might find what I\u0026rsquo;ve just said to be obvious. But I\u0026rsquo;m stating it explicitly to help people understand how Data Scientists interact with engineering organizations which build machines. If the Data Science team requests something (e.g. more sensor data logging), there are more considerations to be accounted for compared to a software only product. There is also a distinct line between what should be \u0026ldquo;owned\u0026rdquo; by Engineering teams versus Data Science teams because the ramifications of that ownership can be quite serious. I\u0026rsquo;ll discuss this in more detail later in the article.\nIt\u0026rsquo;s also important to understand how the Data Science projects I\u0026rsquo;m referencing in this article are conceptualized and how they show business value. In my experience, there are two types of Data Science projects. One type generates value by creating new businesses and processes, and the other generates value by optimizing existing processes.\nPredictive Maintenance typically falls into the latter type. The machines already exist, and the organizations servicing those machines already exist. This means the fundamental business around operating and servicing also already exists, so the goal is to uncover opportunities to improve existing processes.\nA contrast to this is Data Science powering a new business venture. A good example of this was Stack Overflow jobs. Stack Overflow is a popular website where programmers can ask and answer programming questions. At some point Stack Overflow management saw value in also offering a job site, and they created one which used machine learning to power some features. I\u0026rsquo;m not associated with Stack Overflow and have zero insight into what their decision process was. But I can only imagine product management saw they had a large userbase, the mindshare of software people, and lots of data about tech skills, so they were in a good position to create a job matching platform\nRegardless of the type of Data Science you are doing, the ultimate constraint is always financial reality, or will the project(s) result in a financial benefit for the company. Compared to a creating new business, optimizing existing processes can have more constraints because you have to work within the boundaries of those processes. This is how, in predictive maintenance, the universe of problems Data Science teams can work on becomes constrained by Engineering on one side, and Service on the other.\nThe Constraints # Engineering # If we continue this idea of constraints limiting what type of use cases Data Science teams can focus on, there are two main constraints from Engineering. The first is trying to clearly understand what should be owned by Engineering, and what should be owned by the Data Science team. The second is about getting support for what data is acquired and reported by the machine.\nWho Should Own It? # When we think about ownership, it\u0026rsquo;s useful to imagine ownership as a spectrum, with the Engineering org on one end, and the Data Science team on the other. On one end is building and supporting the machine hardware and software, which is clearly Engineering. As you move along the spectrum to topics where people are tying to do things with the data being generated by machines, we see use cases where it starts to make sense for the Data Science team to own them. Keep in mind there no clear line delineating ownership, rather a fuzzy area.\nPart of reason this area is fuzzy is because you need to know who is responsible when something goes wrong. The person using the machine, or the customer, needs a clear escalation path when they are having an issue. Sometimes \u0026ldquo;Who is willing to take the blame\u0026rdquo; is a powerful factor in deciding ownership.\nProviding a rigorous exploration of this fuzzy area is also nearly impossible without including specific details about the industry we are talking about. Data Science teams can also be in the Engineering org, outside the Engineering org but closely aligned (e.g. like Product Management), or very independent from the Engineering org. This type of structure only serves to increase the fuzziness. So instead of writing a thesis on it, let\u0026rsquo;s just explore it with a few examples.\nTake, for example, a diagnostic computer vision/imaging system used in a medical device like a CT Scanner. For this type of machine, the computer vision system absolutely needs to be owned by Engineering because the imaging is part of the core functionality. A major reason for this is because the ramifications to patients can be severe if the vision system is not working as expected. The CT scanner company and their Engineering organization need to stand behind the quality of the product.\nHowever, with that same machine, it\u0026rsquo;s possible a model predicting a component failure with the imaging system could be owned by a Data Science team. Imagine some component in the imaging system gets very hot during usage, so there is a cooling system with a fluid pump. If the pump malfunctions, the imaging system stops working, and the fix is to install a new pump. When the pump fails, the entire machine is stopped.\nThe impact of this pump failure is very different from the vision system not working as expected. If the pump fails, the machine goes down and no patients can be scanned. Compare this to the liability if a vision system malfunction leads to an incorrect diagnosis. If the model predicting a pump failure is incorrect, the cost is mostly the wasted time and materials of servicing that part. A misdiagnosis is a far worse outcome. So it makes sense for the Data Science team to own this pump failure use case instead of the Engineering team.\nAnother example can be seen with operating limits. Many machines have recommended operating limits (e.g. maximum recommended speed). Ideally, there are software or hardware controls preventing those limits being exceeded. It\u0026rsquo;s very possible these limits are perceived as unnecessarily small, and a Data Science team is asked if they can determine a greater limit. If the Data Science team works for the same company that built the machine, I think this situation is one where the they should involve Engineering because the ramifications of something going wrong can be large.\nIt\u0026rsquo;s better if the Data Science team feeds their findings back to Engineering. I have been involved in a project where we looked at historical data and provided evidence to the Engineering team about a limit being too low, resulting in too many warning/errors occurring. It then fell on Engineering to investigate the issue and decide if they wanted to change the recommended limit. It was also their responsibility to communicate this change of machine behavior to customers.\nOne use case where this line is less clear is maintenance schedules. Many machines come with recommended service intervals, or how often you need to service various parts of the machine. This is similar to the mileage/time intervals for changing the oil in a car. One strategy companies are interested in is \u0026ldquo;Predictive Service\u0026rdquo;, where you use a data driven approach to determine when you should perform maintenance, as opposed to just following some guidelines.\nPredictive Service is one area where the ownership is going to depend on how it\u0026rsquo;s applied. If the company that sells the machine is also selling predictive service, Engineering should own it from a customer perspective, even if the Data Science team builds and provides the actual analytics and maintenance recommendations. If a machine user (e.g. a customer) wants to optimize their maintenance intervals based on some production metric (e.g. quality of the product being manufactured), it\u0026rsquo;s makes sense for a team outside Engineering to own it.\nOwnership should also consider political implications. I was asked to develop a model to detect when a particular part started behaving abnormally. The part had issues quite often, and there were probably 20+ of these identical parts in each machine, with 1000+ machines globally. The impact of this abnormal behavior was quite high, and despite lots of conversations between us and Engineering, there was no near-term prioritization for a part redesign. So it made sense for our team to build a model.\nWhen the model detected abnormal behavior, the service technicians went out and replaced the part before it impacted the user/customer. A few months after this model went into production and became known internally, Engineering let us know they had already redesigned the part and starting in three months, all faulty parts would be replaced by the more reliable version. What my model really did was shine a light on an issue which was already a sore spot. The model I built was taken out of production soon after. There\u0026rsquo;s more to this story, but let\u0026rsquo;s leave it there.\nIssues that should be recalls or service bulletins, issues where a model shines a light where it shouldn\u0026rsquo;t shine, issues where a model is used as a workaround for some other flaw in the device, etc, are all examples of where Data Science teams should be mindful of considering all the factors when it comes to which team should own it.\nGetting Engineering Support # To be able to successfully execute a use case, you also need support from Engineering to understand the hardware, software, and process generating the machine data you are trying to build a model on.\nI\u0026rsquo;ve written about machine data in another article, but one thing to remember is machine data typically makes no sense on it\u0026rsquo;s own. Looking at machine data without context is no different than if I gave you a big file of numbers and told you to \u0026ldquo;extract business value\u0026rdquo; out of it. To understand the data, you need to understand the process generating it. You need to be able walk through what the machine is doing and how those actions are reflected in the data.\nIt\u0026rsquo;s possible for experts outside of Engineering to provide you with information about how the machine works. An example are field technicians, who can be a fantastic source, as they have been trained to service the machine and have experience using data to diagnose and fix issues. People with similar expertise, such as technical support and machine operators, can also tell you a lot about the machine.\nEven though these people can help, it\u0026rsquo;s important to remember the people in the Engineering org are the only ones who really know how the machine works. Engineering teams provide the source material used to create training, documentation, and they are the ones who answer questions when things are not publicly documented. So it\u0026rsquo;s important for them to be in the loop when you have questions about the machine.\nAnother form of support you need from Engineering is when you need new machine data that isn\u0026rsquo;t currently collected and/or exposed. This requires Engineering to add new functionality to the data collection software, and they are the only ones who can tell you if they are willing to add it and provide a timeline. Keep in mind if there are hardware or software limitations, they may not be able to add those features even if they wanted to, so things can be out of their hands as well.\nService # If Engineering is responsible for designing the machine, the Service organization is responsible for the machines actually working in the field. The Service personnel are the ones who know how the machines actually work in the real world, and how to fix problems when they occur. Service orgs are constrained by very tangible and non-negotiable things like the availability of spare parts and number of people available at any given time to fix an issue.\nThese practical constraints have a direct effect on predictive maintenance use cases, because it means models have to be pragmatically useful, and it\u0026rsquo;s not enough to create models that look good in metrics, but are practically unfeasible. Here are a few examples of these constraints.\nDue to personnel and logistical limitations, Service orgs might need a defined minimum (e.g. 7 days) amount of advance scheduling/lead-time to be able to respond to non-urgent issues. This means predictive models need to detect an issue and notify the service personnel with at least that minimum lead time. Even if you develop a high accuracy model, but it only provides 2 days of advance warning of a part breaking, it\u0026rsquo;s not useful to the service organization because they cannot act on them.\nAnother constraint comes from the varying diagnostic skill levels of the people (service technicians) who fix the machines. Some technicians are very good at figuring out the root cause of a complex issue, and others are better at following a diagnostic rubric and adhering to clear instructions. This has a direct implication on predictive models, because you cannot expect every technician to be able to determine the cause of a message like \u0026ldquo;Model says something abnormal is occurring with system X.\u0026rdquo; This limits the use cases to issues where it\u0026rsquo;s possible to create a highly prescriptive diagnostic rubric with clear repair instructions, so the technician knows exactly what to do with a prediction. It\u0026rsquo;s possible many types of models, or even the use of machine learning in general, becomes impractical because of this.\nAnother issue can arise from supply chain constraints. Parts availability issues can mean certain parts are hard to get, so nobody will change them unless they are replacing a broken part that is already impacting a customer. What this means is even if a model predicting a part will break is of theoretically high business value, nobody wants to take the risk and possibly answer a question like \u0026ldquo;Why did we change this part for customer Z when it wasn\u0026rsquo;t even broken, and now this other important customer is screaming because we don\u0026rsquo;t have a part for them.\u0026rdquo; It\u0026rsquo;s challenging for people to stick their necks out and say they\u0026rsquo;ll support a predictive model in this situation.\nService Personnel Can Provide Expert Knowledge # In my experience, you need a subject matter expert from service to be able to talk to you about the machine/parts, and to walk you through what the machine is doing and how this is reflected in the actual data. These types of conversations are what allows a Data Scientist to build an appropriate data pipeline (e.g. feature engineering) and the right model evaluation metrics. To be able to have these conversations, people need to be available. Many times, the service people with the skills to help just don\u0026rsquo;t have the time to support the Data Science team because they have to do their primary job of ensuring machines are running.\nOrganizations can try to deal with this people availability issue in different ways. One possibility is to ask people from the Engineering org to support these discussions because Engineering staff know how the machine works, and they typically travel a lot less than service personal, giving them more flexibility in their schedules. The challenge with this is sometimes Engineering has limited exposure to machines outside of a testing environment, and they have limited experience in crawling through data to diagnose and repair a broken machine. It\u0026rsquo;s a lot like talking with a teacher versus a practitioner. You need both, but each of them helps you to solve different types of problems.\nWhat this boils down to is certain use cases are never going to be practical because nobody from the Service org can find the time to support them. This can be extremely frustrating for Data Science teams because you simply cannot pursue some opportunities when nobody has the time to keep digging through things with you until you come up with a working solution.\nService Data # Service orgs tend to have their own data, separate from the data generated by the machine. For predictive maintenance, this includes information about every service event, such as a ticket number, customer call tracking, diagnostic and resolution steps, parts changed, repair time, what the issue was, etc. If you are building an machine learning model, these service events are the labels or \u0026ldquo;ground truth\u0026rdquo; you are trying to predict.\nUnlike machine data, service data is human generated and it\u0026rsquo;s messy. You need people from service to help you interpret it properly. This is especially important if the data is global because different regions can have different standards of how the data is supposed to be entered. I know of at least one project where someone came up with a seemingly great performing model, only to see the model performance collapse after they learned they had misinterpreted the labels in the service event data. The service experts are the only ones who can point out and explain the nuances of their service data.\nGetting Caught In the Middle # A Data Science team working in this environment has to navigate a narrow path to find use cases where they can deliver actual value to the business. And they have to do this while getting the support of Engineering and Service, even if that means the scope of what they can do is greatly reduced. Sometimes the path is so narrow, you\u0026rsquo;ll have to admit advanced analytics and machine learning isn\u0026rsquo;t even feasible.\nThis may sound frustrating, but one major benefit is it forces you to be grounded in projects with actual impact. Working with Service provides you the input to understand what really matters to them and customers, and working with Engineering provides you with the foundational knowledge needed to build a successful solution. With input from both sides, you can avoid working on science projects with zero impact.\nFor most of this article, there is the tacit assumption of the Data Science team having access to Engineering and Service. In many cases, this isn\u0026rsquo;t true. There are companies building machines, but they don\u0026rsquo;t operate or service them. And there are service organizations (think about your car mechanic) who only know as much about the actual engineering of the machine as a repair/service manual tells them. If you work for one of these companies, you might assume it\u0026rsquo;s easier to be a Data Scientist because you only have a constraint on one side.\nIt\u0026rsquo;s possible having a single source of feedback actually makes things harder. I once did some work for a machine builder (Engineering) who wanted an IoT powered Analytics add-on, and they wanted some input on what data they could use to support this type of service. I asked them if we could talk about what problems their customers saw with their machines, and their answer was an honest \u0026ldquo;We don\u0026rsquo;t use the machines, we only build them. So we don\u0026rsquo;t know.\u0026rdquo; I\u0026rsquo;m sure this answer was a bit exaugurated, since they must have had some form of product management to understand customer needs and major issues. But at the end of the day, they didn\u0026rsquo;t understand the day-to-day operational challenges their customers faced well enough to provide useful input for Analytics features people would be willing to pay for. It left me in the position of politely saying I wasn\u0026rsquo;t sure what I could do for them.\nOn the other side, if the only input you have is from service, it\u0026rsquo;s entirely possible to be in a situation where you cannot understand the data available to you. And without Engineering to provide insight, sometimes it\u0026rsquo;s not even clear if the data accurately represents what is happening in the machine. One possible end result is the only feasible use cases are of low business value, and a simple rules based approach is the only option. There no real need for a Data Scientist in this situation.\nEverybody Faces Constraints # To wrap things up, let\u0026rsquo;s talk a bit more about constraints. I\u0026rsquo;ve talked about Data Scientists walking between Engineering and Service, but these are just another way of saying the theoretical and the practical. Data Scientists working on other use cases for machines, and in other industries, walk this same line when they try to figure out if they can build a useful model based purely on data with no understanding of the process that generated the data, or trying to incorporate some of the process knowledge to improve the model. When you need both, it\u0026rsquo;s always going to limit the number of possible use cases. But it has the benefit of focusing your work, so you can find things with an actual financial impact.\n","date":"1 December 2024","externalUrl":null,"permalink":"/posts/threading-the-needle-between-engineering-and-service/","section":"Posts","summary":"When you develop solutions for predictive maintenance, one of the challenges is being caught between the people who design/build the machines, and the people who service machines.  Whatever you build has to provide business value without stepping on the feet of either of those parties.  In this article I’m going to try and provide insight into what it means to walk this line.","title":"Threading the Needle between Engineering and Service","type":"posts"},{"content":" Defining Downtime # Predictive Maintenance is fundamentally about two goals - keeping your machines running, and making them run in the best way possible. In other words, focus on reducing downtime and continuously optimizing how your machines run. The concept of downtime is easy to describe and understand, but calculating it can be much more complex than people realize. In this post I\u0026rsquo;ll walk through calculating downtime for a factory machine, and how the complexity of the calculation reveals why using predictive models for predictive maintenance is so challenging.\nBefore discussing anything else, I\u0026rsquo;d like to make my goal for writing this post explicit. I\u0026rsquo;m not really trying to teach anybody how to calculate downtime. What I\u0026rsquo;m really trying to do is illustrate there being a lot of complexity and uncertainty in calculating downtime, a fundamental metric to understanding how well a machine runs. And if something so simple sounding can be complex, it hints at how complex other KPIs or metrics can be. This has a direct impact on how difficult it can be to create a dataset to train a useful predictive model to address a predictive maintenance use case.\nTo provide some context, let\u0026rsquo;s first describe a concrete scenario. Imagine we have a factory which manufactures something common and tangible. Examples of this include plastic bottles, napkins, an electrical switch, etc. Feel feel to pick whatever you like. In a broad sense, there are three major activities at the factory.\nReceiving, inspecting, and testing raw materials Using machines and people to turn the raw materials into the product we are manufacturing Final Inspections, packaging, and shipping of the manufactured product to customers We are mostly going to focus on activity 2.\nLet\u0026rsquo;s start by broadly defining downtime, and we will evolve this definition later. Downtime is the calculation of how much time a machine is not running. The opposite of downtime is uptime, and you can usually calculate one from the other. For example, if a machine is stopped for 4 hours in a 12 hour period, the downtime is 4 hours and the uptime is 8 hours.\nTo calculate downtime for one of our machines, what data would we need? We can start by looking at the machine log data (this is collected by the machine as it runs), which hopefully contains all the times it started and stopped. In the real world, many machines do not log this information, or they only log selective start/stop events. But for the sake of this story, let\u0026rsquo;s assume our machine collects and logs these events correctly.\nWe can calculate downtime from start and stop events by using logic such as the following.\nIf we see a start event at 11AM and a stop at 12PM, the machine ran for 60 minutes. If we see a stop event at 12PM and a start event at 1PM, the machine was stopped for 60 minutes. By simply adding up these time spans, you can determine how much time the machine was not running over the span of a day, week, month, etc. The sum total of these numbers gives you a very rudimentary downtime calculation.\nNow let\u0026rsquo;s add the first wrinkle in this calculation. The machine logs shows the machine stopped at 4:30PM on Friday, and started at 8:00AM on Monday. Machine logs won\u0026rsquo;t provide a reason for this, but we can guess it. To properly calculate downtime, we need to incorporate the company calendar to account for weekends and holidays. If we are calculating downtime for different sites and geographical locations, adding in shift information may be useful as well.\nAdding this information changes our conceptualization of downtime. What we really need to know is when the machine was not running when it was supposed to be running, and not just assume a machine not running equals downtime. What we need to do is split our downtime calculation into planned and unplanned downtime.\nPlanned and Unplanned Downtime # Planned downtime is when a machine is stopped and the machine wasn\u0026rsquo;t supposed to be running. If we were performing a weekly downtime calculation and had included the two day weekend as downtime, we would conclude the machine was down for 28% of the time, even if it ran perfectly during for five working days. Including planned downtime in our overall downtime calculation will overstate the downtime, so we shouldn\u0026rsquo;t include it.\nUnplanned downtime is when the machine was not running when it should have been running. In my experience, this is what companies really want to know. The goal is to reduce unplanned downtime as much as possible.\nPlanned Downtime # Beyond weekends and holidays, planned downtime includes things like quarterly scheduled maintenance, rebuilds, and all the regular maintenance activities occurring on a known cadence.\nPlanned downtime also includes events that are little more squishy, when it\u0026rsquo;s harder to track exactly when the machine started and stopped. This includes stoppages when materials need to be replenished, when the person operating the machine changes, some quick maintenance (e.g. adding some fluid), and something called changeover time. Changeover time occurs when you do something like changing a die/mold in a machine. If these events aren\u0026rsquo;t included in any data sources, assumptions needs to be made about if these stops are planned or unplanned. I\u0026rsquo;ll discuss this more later.\nElaborating our Definition of Downtime # We can update our definition of downtime now. Downtime is a calculation of how much time a machine is not running when it should be running. In more formal terms, Downtime is how much unplanned time the machine has not been running. At least in theory Downtime can also be calculated by taking the total time the machine was not running and subtracting the planned downtime.\nBased on what we\u0026rsquo;ve stated already, the data sources needed to calculate Downtime are:\nMachine Data Logs for Start/Stop times Company Calendars (Oracle/SAP/Spreadsheets/etc. Hopefully not paper) Shift Schedules (ERP and other systems. Hopefully not paper) A source containing \u0026ldquo;Squishy\u0026rdquo; stops. For example, the Machine Data Logs may record when the machine operator changed (e.g. a new operator logs into the machine), the machine was stopped for a die change, etc. This might be in a different log file than the start/stop log. Some reconciliation needs to done between the different log files to correctly calculate the stop times. Using the data listed above, we can figure out the total downtime (using the machine data logs) and the planned downtime (using the other data sources). We then get the total unplanned downtime by subtracting the latter from the former.\nBefore moving on, let\u0026rsquo;s pause here to talk about the realities of calculating planned downtime. All this data exists in different sources and need to be reconciled. If all your machines exist in different time zones and geographies with different working laws, standards, and holidays, your data reconciliation needs to account for all of this.\nReconciling the squishy parts is challenging since you can\u0026rsquo;t always directly track what event really happened. For example, if your machine records a stop and then 10 minutes later a new operator logs in, was this a shift change? Why did it take 10 minutes? Was there a coffee break? Or was there an issue with the machine the first operator could not figure out, so they called a more experienced person to try an diagnose it? What if the operator/personnel changes aren\u0026rsquo;t recorded in the machine logs at all, but according to the shift schedule the operator should have changed. Why did it take 10 minutes? Was it a planned or unplanned stop? Is 3 minutes a better time for an operator change? Maybe 15?\nIn many cases, the only way to handle this is to just set a time limit for when something should be included in unplanned downtime. For example, any stoppage for less than 5 minutes is considered as planned downtime if the cause is not known. 5 minutes or more is considered unplanned downtime if we cannot clearly identify it as planned downtime. Somebody needs to make a decision about the limit, and these types of squishy calculations always end up influencing downtime calculations. I\u0026rsquo;ll discuss this in more detail later.\nUnplanned Downtime # Now that we\u0026rsquo;ve talked about the stoppages we want to exclude from our downtime calculations, let\u0026rsquo;s discuss the stoppages we want to include (unplanned downtime). From a business and operations perspective, these are the stoppages we want to reduce or eliminate so our factory runs more efficiently.\nLet\u0026rsquo;s extend our story. We\u0026rsquo;ve taken our machine data, determined the stoppages, and removed everything overlapping with weekends, holidays, recurring maintenance periods, and all other known planned activities. The time remaining is unplanned downtime. Now the question is, is this good enough? Do we take this number and calculate it as a fraction of the total target runtime (e.g. 8 AM - 4PM everyday) and build our downtime KPI?\nTypically KPI building is not just an exercise in quantifying something. In a manufacturing setting, one of the reasons to have a KPI is so you can track and improve it. And to improve a KPI, you need to understand what is driving it. For downtime, you need to be able to identify the root cause of it so you know where to focus your remediation efforts. For example, if 70% of your downtime is caused by a single part breaking over and over again, you should focus on figuring out how to reduce or prevent it.\nDetermining these root causes forces us to both broaden and deepen the data we are using. To broaden our understanding of the environment in which this downtime is occurring, we need to look at more data sources. This enables us to dig deeper into our data to see what specifically is going on during a downtime. To make this point more clear, let\u0026rsquo;s look at some examples of the root causes of unplanned downtime.\nMachines Do Not Exist In Isolation # Machines can stop because of issues with the machines around them. If there are 10 machines in a row and the first machine feeds the second, the second feeds the third, etc., any issue with one of these machines will result in the other machines stopping. For example, if the second machine has an issue, all the machines after it will be \u0026ldquo;starved\u0026rdquo; because they are not getting the supply of what they need from the previous machine. This works the opposite way as well. If machine nine is having issues and cannot \u0026ldquo;consume\u0026rdquo; what is coming out of machine eight, every machine before nine has to stop.\nIn this scenario it\u0026rsquo;s important to realize you have to classify these unplanned downtimes differently. Out of those ten machines, only one machine is really down. The other machines are not really down, they are available and ready to run, but they are simply stopped. The machine with issues it not Available to run, but the other nine machines are Available to run. This distinction is important because you want to understand why your machine stopped, not just how long it stopped.\nThis idea of Availability is defined as part of a metric called OEE (Overall Equipment Effectiveness). OEE tries to provide a more robust metric of your manufacturing productivity in comparison to just looking at downtime or output. I\u0026rsquo;ll discuss OEE in more detail soon.\nIf we are calculating a downtime KPI for the entire factory, saying the factory is having unplanned downtime because a single machine has blocked all production is an accurate statement. But at a machine level, saying all ten machines are having unplanned downtime is misleading. It presents an inaccurate picture of what is really going on. Correctly attributing what is going on with each machine allows us to correctly identify the root cause.\nDrilling Down into Subsystems # Until this point, we\u0026rsquo;ve mostly focused on downtime at a machine level, or looking at the downtime for an entire machine. From a root cause perspective, this an overly simplistic view. Some machines can be very complex and are better envisioned as multiple machines working together as one big machine. An analogy to this is a car. A car is a single machine, but it\u0026rsquo;s really a bunch of complex machines (e.g. the engine, the transmission, the heating/cooling system) working together. A modern car has multiple computers controlling each system, and the computers talk to each other, ensuring the car functions transparently as a whole. Complex machines in manufacturing work the same way.\nWhen we establish the root cause of downtime, our goal is to identify the specific part of the machine causing the downtime. In a complex machine, we\u0026rsquo;re really asking, \u0026ldquo;what part of which subsystem caused the downtime?\u0026rdquo;\nThis type of root cause attribution is typically done in two ways. First we can look at the machine data for clues. Machine event logs should (hopefully!) contain events occurring before and after the stops. Some machines track a lot of events, and they may log a \u0026ldquo;reason\u0026rdquo; for the machine stoppage. Note this process is less trustworthy than it sounds. Sometimes the reason logged for the machine stop is not the cause of the issue.\nTo provide a historical example of an incorrect cause and effect, we can look at the time when smart phones started to have built-in GPS\u0026rsquo;s and people started using their phone navigation systems. Typically, the phone was placed on a phone holder on/above the dashboard, and was directly exposed to the sun. At the time, using the GPS system on the phone caused the phone to heat up considerably. On a hot sunny day, the combination of GPS usage plus direct sun (through the windshield) would result in the phone overheating and shutting down. If you were to look at the sensor and event logs of the phone, you\u0026rsquo;d probably see the GPS being used, the temperature steadily climbing to a limit, and then a shutdown event. Just looking at the data, the \u0026ldquo;cause\u0026rdquo; is the GPS. In reality, it\u0026rsquo;s really the direct summer sunlight and the design of the GPS system.\nTo add to the challenge of determining which subsystem is having issues, sometimes the data does not clearly identify when a single subsystem stopped. What you might have to do is infer the stop based on other machine activities. For example, if a routing system (e.g. a conveyor belt) log shows your machine stopped sending widgets to a particular subsystem, you have a sign something is going on. But is the subsystem down? Is there something broken with the routing system? Is a downstream subsystem having issues, blocking this subsystem? These are all possibilities, adding to the complexity of attributing downtime correctly.\nThroughput Issues # The last issue I want to discuss is throughput or reduction in production output. If your machine is able to produce 100 widgets per hour, but there is an issue reducing this to 50 an hour, there is clearly a reduction in performance. Performance is a metric used in OEE.\nIf the machine output is lower than expected, this isn\u0026rsquo;t considered downtime for the overall machine. However, it\u0026rsquo;s possible one of the subsystems stopped working and is the reason for the production loss. That subsystem should be counted as having downtime.\nOne scenario where a subsystem can be down, but the machine is still running, occurs when a machine has redundant systems doing identical work. Let\u0026rsquo;s say a machine has four identical subsystems to clean bottles. When a bottle arrives to be cleaned, it\u0026rsquo;s routed to the subsystem which is available to receive it. When one of the subsystems stops working, there are three others working. Redundant systems offer the dual benefit of increasing throughput by doing work in parallel, and they also help ensure your manufacturing line isn\u0026rsquo;t completely stopped because you only have one subsystem for a particular task.\nThroughput loss is just another example of the complexity you see in the real-world, making downtime calculations less straightforward than it initially seems.\nRedefining Downtime # With all this new information, we should revisit our downtime definition.\nDowntime is a calculation of how much time a machine is not running when it should be running. In more formal terms, Downtime is how much unplanned time the machine has not been running. At least in theory Downtime can also be calculated by taking the total time the machine was not running and subtracting the planned downtime. It\u0026rsquo;s also important to correctly identify when a machine was down because it was having an issue, or if it was available but stopped.\nThe Data Sources needed to calculate Downtime are:\nMachine Data Logs for Start/Stop times for all machines so we can identify which machines are truly down, and which are available but blocked due to other factors. Other Machine Data sources, including operator logs, subsystem logs, and sensor data, allowing for downtime tracking on subsystems, and also for root cause identification. Company Calendars (Oracle/SAP/Spreadsheets/etc. Hopefully not Paper) Shift Schedules (ERP and Other Systems. Hopefully not paper) A source containing \u0026ldquo;Squishy\u0026rdquo; stops. For example, the Machine Data Logs may record when the machine operator changed (e.g. a new operator logs into the machine), the machine was stopped for a die change, etc. This might be in a different log file than the start/stop log. Some reconciliation needs to done between the different log files to correctly calculate the stop times. Service Data to see what repair/maintenance work was done. This data can help identify the root cause of issues Squishy Stops # I\u0026rsquo;ve talked a bit about \u0026ldquo;squishy\u0026rdquo; stops but haven\u0026rsquo;t really provided many details. Realistically, you can\u0026rsquo;t clearly identify the cause of every stop. The reason I called it \u0026ldquo;squishy\u0026rdquo; is because it\u0026rsquo;s not clear if a stop falls into the planned or unplanned bucket and there\u0026rsquo;s limited data to strongly support either case. What happens is you end up in a situation where you have short stops of various lengths (e.g. 1 minute, 3 minutes, 7 minutes, 10 minutes). You might feel like ignoring or discarding these, but they can add up to a lot of time, which can greatly change your downtime KPI. This forces you to include them in your KPI calculations.\nA typical way to deal with squishy stops is to have a cutoff based on the expertise of the manufacturing and operations people. For example, they might know most of the quick maintenance activities (e.g. refilling a fluid) can be done in under 5 minutes and propose that as a cutoff. Anything shorter than 5 minutes is considered planned (and should not be included in the downtime KPI), and anything longer than 5 minutes should be considered unplanned downtime unless it can be clearly attributed otherwise.\nOne challenge with using a cutoff is moving this value by a seemingly insignificant amount (e.g. 5 minutes to 6 minutes) can result in a meaningful change in the downtime KPI. This change could be large enough to move a conversation from \u0026ldquo;KPI looks okay but could be better\u0026rdquo; to \u0026ldquo;This is a significant problem and we need to escalate.\u0026rdquo; When the KPI is built, it\u0026rsquo;s important you trust the experts picked that cutoff value for the right reasons, and not to game the KPI.\nOEE (Overall Equipment Effectiveness) # You might be reading this article and asking yourself how manufacturers deal with all of this and come up with useful KPIs. Note the challenges of tracking downtime, and figuring out what is downtime and what is not, is well recognized in the industry. There\u0026rsquo;s an industry standard metric called OEE defining this formally and standardizing it.\nOEE is a combination of three KPIs. Availability, Performance, and Quality. We\u0026rsquo;ve already talked a bit about Availability and Performance, but let me summarize. Availability is measuring the percentage of time your machine is available when it should be available. Availability is directly related to downtime. Performance is not about downtime, but really about throughput. It\u0026rsquo;s a measure of how much you are producing in relation to some theoretical maximum.\nQuality is a measure of what percentage of what you produce is defective. This metric isn\u0026rsquo;t directly related to downtime or this article, but it does relate to predictive maintenance.\nSomething I like about OEE is there is flexibility in the how the calculations for the underlying KPIs are defined. When calculating OEE, it\u0026rsquo;s understood different machines and manufacturing facilities produce different data, and people can only define things like Availability and Performance based on the data they have. So there\u0026rsquo;s room to deal with the squishiness of the data behind these KPIs.\nhttps://www.oee.com/ has some great information on OEE, including examples of how OEE is calculated, and also a scenario where OEE is used to identify the root causes of low OEE and how to address it.\nData Quality and Predictive Maintenance # Calculating downtime to a reasonable degree of accuracy requires us to reconcile all the data sources described in this article. This reconciliation is much harder than many people realize. Here are a few examples of why.\nDifferent machines are built by different machine builders (vendors), and the data collected from each machine can be very different from other machines. Different machines record different \u0026ldquo;levels\u0026rdquo; of stops depending on how the machine was programmed. One machine might only record hard stops, such as safety stops and other serious stops. Some machines log 10 different kinds of idle modes and multiple types of stop signals, so a lot of logic needs to be coded to determine when the machine actually stopped running. The amount of data recorded from each machine is dependent on the technology available when the machine was built. When a particular machine was designed in 2018, it was safe to assume customers\u0026rsquo; factories would have high speed networking and they would not be surprised if a machine generated a gigabyte (uncompressed) of data per week. If the machine was designed in 2005, engineers would have a very different idea of customer\u0026rsquo;s data networks and capabilities, meaning machines would collect and report much less data. Matching different datasets at a minute to minute level can be extremely difficult, and sometimes impossible if different datasets are reporting data at different frequencies. Operator Data, Service Data, Shift Schedules, etc. are all in different systems and created using different technologies and interface elements. Old touchscreens (which people hate using), modern tablets (which are inconvenient for typing large amounts of text), laptops, drop downs, free text fields, UIs in a browser, a 15 year old application running on Windows 7, etc. Paper is also more ubiquitous than people realize, and it usually ends up being manually transcribed into in Excel. Companies have very different tolerances for how much data to record and store. This has greater implications to predictive models than calculating downtime, but it means sometimes data simply isn\u0026rsquo;t saved in a way allowing people to use it for analytics I\u0026rsquo;ve got two anecdotes to add some concreteness to the above.\nI was asked by a manufacturing operations team to help \u0026ldquo;predict downtime\u0026rdquo; which they didn\u0026rsquo;t clearly define. When they sent me data extracts from the machine, there were large gaps. The machine log would show a stop event, followed by a large gap of time with no data, and then there was data showing the machine was up and running without any signs of starting back up. The opposite happened as well. The machine would be running, there would be a gap in the data, and then it would be running again, only with signs there was a stop and restart. It wasn\u0026rsquo;t clear to anybody if this missing data was not being recorded, if there was a bug in the logging, if the network was failing and data was getting lost, or what was really going on. The only stops clearly being recorded were safety or sudden manual stops performed by a person. The only way they could fix this would be to call an industrial engineer to check and possibly update the software on the machine, which they weren\u0026rsquo;t willing to do because the machine was very old. Nobody wanted to take a chance of breaking the machine software.\nIn another case, there was a complex machine (with many subsystems) which generated lots of different logs at different levels of event granularity. The most detailed log generated about 3 Terabytes of data a month. This was for a single machine, and they had 500+ of them globally. Using this detailed log would give them a much more robust way to understand downtime, but they simply didn\u0026rsquo;t see the business need to be so precise in their KPI. That data might have enabled other predictive maintenance use cases, but the business justification just wasn\u0026rsquo;t there to collect and store it.\nWrap Up # Let me go back to what I said at the start of this article. Predictive Maintenance is fundamentally about reducing downtime and optimizing how a machine works. A model may be predicting the failure of a particular part, detecting abnormal operations, or trying to optimize the quality of something produced. At the end of the day, downtime and/or a machine running sub-optimally means you produce less, it costs you more to produce whatever you are producing, which reduces your profits. Every predictive maintenance model is simply a small way of tackling one of these two problems.\nStatistics, Machine Learning, and Deep Learning (AI) exist because many signals or trends cannot be detected by simple rules or looking at charts. We use these algorithms when we know the data is noisy and the logic to separate the signal from the noise is complex. To build a good model, you need good data. And to have good data for the model to learn from, a person needs to clearly identify when an issue has occurred or not occurred so an algorithmic model can learn to detect or predict this.\nWhat I\u0026rsquo;ve tried to show in this article is how challenging it can be to even calculate something as simple sounding as downtime. If something so basic and essential has a \u0026ldquo;squishy\u0026rdquo; component, try to imagine how complex it is to build a well labeled dataset for a predictive model. And there are \u0026ldquo;squishy\u0026rdquo; components in building these training datasets as well, which means if you change something as simple as a cutoff, the effect on the model can be very dramatic. I talked about this in detail in my last article about the effects of label noise on model performance.\nI realize using the challenges in calculating downtime as evidence to support the challenges with building a model training dataset is quite a leap in logic. But I hope this article gives people a glimpse into why so many predictive maintenance models flounder and struggle due to data issues, and why data and data quality investments are worth it.\n","date":"31 July 2024","externalUrl":null,"permalink":"/posts/calculating-downtime-is-harder-than-it-sounds/","section":"Posts","summary":"Predictive Maintenance is fundamentally about two goals - keeping your machines running, and making them run in the best way possible.  In other words, focus on reducing downtime and continuously optimizing how your machines run.  The concept of downtime is easy to describe and understand, but calculating it can be much more complex than people realize.  In this post I’ll walk through calculating downtime for a factory machine, and how the complexity of the calculation reveals why using predictive models for predictive maintenance is so challenging.","title":"Calculating Downtime is Harder Than it Sounds","type":"posts"},{"content":"Introduction # People often underestimate how false alarms from predictive models can erase all the business value you set out to create from those models. In this post we will explore how label noise can drive up your service costs instead of decreasing it, and how a small fraction of incorrect predictions of impending machine issues can erase all the benefit created by correct predictions.\nIn my experience, many predictive maintenance projects follow this pattern.\nA financial analysis is performed to understand the current cost of servicing a machine or a fleet of machines. People look at different ways to reduce these costs, and will estimate how much proactive service (e.g. fixing issues before they happen, better scheduling of predicted service issues, etc) will reduce costs. If the estimated cost savings of proactive service are high enough, a predictive maintenance program will be proposed. An internal and/or external team of Data people will be created to build the infrastructure and models (or rules) to predict future machine issues. Models will be built and deployed for a year or two before the issues and cracks in value realization start to appear. This is assuming some external event (e.g. a recession) doesn\u0026rsquo;t force an early analysis of what is going on. The competent data people will start to recognize the theoretical financial benefit of each model is not being achieved. The honest ones will start openly grumbling about it, maybe not in the most productive ways. The lack of cost savings becomes apparent to service management. How quickly they realize (or acknowledge) it will be highly dependent on many factors. For example, if there are a lot of false Alarms causing a significant increase in costs, this will be recognized faster. Sometimes the growing effect of false alarms is more insidious and it takes longer for people to realize what is going on. At this point people will conduct a more rigorous analysis and realize the predictive models are not reducing costs. One benign scenario are the models generating zero or a very low number of alarms, so there is little or no proactive service occurring. A more damaging situation is if there are a lot of incorrect alarms and people are doing work that doesn\u0026rsquo;t need to be done, actually driving up service costs instead of reducing it. We\u0026rsquo;ll talk more about this later in this article. To help fix the situation, one way forward is for the service and data people to work together to come up with a more accurate way of evaluating the real financial benefit of each model. They then rework or scrap existing models. In a worst case scenario the entire predictive maintenance program is scrapped, people are laid off, and nobody really understands the root cause, dooming this whole process to be repeated in the future. This article is about the latter case in point 6. My goal is to provide insight into how bad Alarms, or false positives, from predictive models can drive up service costs, even in the case where a model is providing a lot of good/accurate Alarms. This will hopefully, help shorten the time between points 4 and 7 above. In an ideal situation, these types of issues will be identified in step 4 before a model is even put into production, so the service people never see a growing divergence between the theoretical and actual cost savings provided by a model.\nThe term Alarm is used in this article and defining it will help clear up any questions about what it means. An Alarm means a prediction from a predictive model indicating some event or failure is going to occur. Predictive Models can also predict something is not going to occur, but this is not be considered an Alarm in this article. Every prediction is not an alarm. Only the predictions of some undesirable event happening are considered an Alarm.\nWhat is Label Noise? # Imagine you have a machine with a water hose, and one of the ways this hose fails is by cracking and leaking. As a Data Scientist, you are asked to predict this issue before it occurs. To do this, you go into the machine service data to find historical examples where this issue has occurred so you can see what the machine data looked like right before the hose cracked. You can then compare this to machine data where this crack did not occur.\nThe service data is the historical record of issues customers have had with their machines, and what work was done to fix these issues. For this particular issue, when a service technician goes onsite to fix it, they enter the cause of the issue in the service ticket, and what they should be entering is some variation of \u0026ldquo;cracked hose.\u0026rdquo; However, during at least one instance where this issue occurred, the technician entered \u0026ldquo;fluid leak\u0026rdquo; instead, making it challenging to clearly identify if it was the hose or something else that was leaking.\nWhen you train a model on this data, what you are doing is providing the model with examples of when the pipe cracked and when it did not, and tasking the computer to learn the difference between the machine conditions when cracks occur and when they do not. The more examples you have, the better the model is able to determine the difference.\nThe examples and data you provide to train the model are typically created (e.g. Feature Engineering) by an Analyst or Data Scientist, and they are based on the information in the historical service data. However, when the service data has errors or lacks clarity, you can end up in a situation where the examples are incorrect. This results in the model learning the wrong things from the examples, which leads to incorrect predictions. With more incorrect examples, you\u0026rsquo;ll get more incorrect predictions.\nEach example used to train this type of predictive maintenance model has a Label indicating if an issue occurred or not. And this Label is how the model\u0026rsquo;s algorithm can tell the difference between machine data indicating a crack will occur, or if it probably won\u0026rsquo;t occur. In an ideal world, the labels would be 100% correct. But as you can see from the example of the service technician incorrectly entering the cause of the issue, it\u0026rsquo;s extremely difficult to ensure all the labels are correct. This results in Label Noise in our data.\nYou might ask, \u0026ldquo;Can\u0026rsquo;t a Data person manually check the labels to make sure they are accurate?\u0026rdquo; It might be possible to do this with a very simple machine, but complex machines can have a lot of machine and sensor data, making it impossible for a person to manually check. There could also be an entire fleet of machines, all working in very different environments (e.g. different geographies and climates) making the data more complicated. So there\u0026rsquo;s no realistic way for a person to manually validate every label.\nLabel Noise Sources # For people who are curious, here are some possible sources of label noise in service data.\nSome possible sources of label noise or incorrect entries in service data.\nMisdiagnosed problems (e.g. thinking a hose is leaking due to a crack, when it\u0026rsquo;s actually incorrectly connected to the machine). Problems being diagnosed by trial and error resulting in confusing and contradictory information in the service ticket. Using ambiguous words to describe the work that was done. An example is using the word \u0026ldquo;fix\u0026rdquo; when a part is replaced. Yes, the issue was fixed, but the fix was to \u0026ldquo;replace\u0026rdquo; something, not perform some repair on that part. Limitations in the software used to enter the service information Example: A technician has to fix multiple issues, but their ticketing system only allows them to enter one issue per ticket, and they don\u0026rsquo;t want to open multiple tickets. A technician using a pre-populated drop down and being forced to pick the best option, not the right one. Accidently picking the wrong item in a drop down because they are using a tablet and their fingers. Issues arising from multiple languages being present in the data, and words having different meanings in different languages. The above is not a comprehensive list. Going into the details of where this label noise comes from is beyond the scope of this article, and is sometimes very specific to the type of machine and process you are working.\nThe Financial Benefits of Predictive Maintenance # Before I get into the impact of label noise, let\u0026rsquo;s talk about the theoretical financial benefits of predictive maintenance.\nIt\u0026rsquo;s important to recognize for commercial machines, the baseline maintenance cost is not $0. Instead, you actually start from a negative. Imagine you have a machine requiring, at a minimum, a quarterly service which costs $2500 for parts and labor. That machine costs you $10K a year to run, so it needs to generate at least $10K in yearly value for you to break even.\nLet\u0026rsquo;s now say the company building that machine says they have redesigned one of the parts, and now you only need to replace it once every six months instead of replacing it every quarter. Replacing this part twice a year instead of once a quarter will reduce your costs by $1k. So now, instead of spending $10K/year, you are spending $9K/year.\nIn the real world, machines are not only serviced during regular service intervals, but also when things break or are not running correctly. This is what is known as reactive service. Basically, when there is a machine issue, somebody reacts by fixing it themselves, calling somebody else to fix it, etc. Reactive Service happens as a reaction to some issue with the machine.\nThis combination of regular and reactive service is your baseline for service costs. If you find you have $10K in regular planned service and $5k in reactive service, your total cost is $15K. If that is your cost per machine and you have 10 machines, your annual service costs are $150K. So when you say you want to reduce your costs, $150k is the baseline you want to improve upon.\nReactive service disrupts normal operations and drives up costs. It\u0026rsquo;s no different than your home cooling system breaking in the middle of summer during a heatwave, as compared to it breaking during a mild spring. In the former case, you will struggle to find somebody to fix it, parts may be expensive or not available, all the while you are living in your uncooled house. You\u0026rsquo;ll essentially be forced to pay the highest price required to get it fixed quickly. If the weather is mild and you have to wait a few days for the fix to save a lot of money, it\u0026rsquo;s not an issue.\nThis is the benefit of proactive and predictive maintenance. Instead of reacting to issues, you get ahead of issues. Having this advance information gives you the ability to plan and schedule things in a way that allows you to reduce your costs and the impact on your operations.\nNow that we\u0026rsquo;ve got the background covered, let\u0026rsquo;s talk about false predictions.\nThe Cost of Incorrect Predictions # False Negatives Might be Okay # Imagine you have a model continuously ingesting all your machine data and predicting if an issue will occur. This model runs twenty four hours a day, seven days a week, and generates hundreds, if not thousands, of predictions a month. This model is also terrible, and never predicts any issues are going to occur. In other words, it generates zero Alarms. What is the financial impact to your service costs?\nThe answer is the impact is basically zero. Having a model that never predicts an issue is going to occur is the same as not having a model at all. Before you had a model, all issues were dealt with reactively and your baseline service costs were based on reactive service. If a model always predicts nothing is going to happen, you\u0026rsquo;re only going to fix things reactively after the issue occurs. So your service costs with this terrible model are the as same before you had it.\nWhen a model falsely says an issue will not occur but it does occur, this is considered a False Negative. We\u0026rsquo;re not going to talk about false negatives any more in this article, but it\u0026rsquo;s good to understand what it is so we can mentally contrast it with a false positive.\nFalse Positives are the Real Problem # A False Positive is when your model predicts an issue is going to occur, but it wasn\u0026rsquo;t actually going to occur. If you perform service activities based on that prediction, you are basically doing work that doesn\u0026rsquo;t need to be done and replacing parts that do not need to be replaced. A false positive is much worse than a false negative because unlike the latter, which has minimal impact on your overall service costs, a false positive increases your costs. A lot of false positives can drive up your costs significantly.\nLet\u0026rsquo;s do some armchair math to understand how much of an impact this can have. Let\u0026rsquo;s say you have an issue which occurs 100 times a year. Dealing with this issue reactively costs you $500 per occurrence, or $50K a year. You look into the benefit of dealing with this proactively, and you conclude that instead of $500, a correct prediction results in your service cost being $250 (parts and labor). So you save $250 compared to the reactive service cost. However, every false positive will cost you $500 because you are doing unnecessary work, and the parts and labor costs are the same as if you had done this reactively.\nBased on those numbers we have the following calculations\nNumber of Alarms True Positives False Positives Value of True Positives False Positive Costs Total Financial Benefit of the Model 100 100 0 $25000 $0 $25000 100 95 5 $23750 $2500 $21250 100 90 10 $22500 $5000 $17500 100 80 20 $20000 $10000 $10000 100 70 30 $17500 $15000 $2500 It\u0026rsquo;s important to notice how quickly the financial benefit of the model gets cut in half. You might think eighty correct predictions and twenty incorrect predictions is great and the model is doing really well, but it\u0026rsquo;s really not. The value of the model quickly goes to zero, or even negative, as the number of false positives goes up.\nHow Label Noise Increases False Positives # Now that we have discussed the cost of a false positive, let\u0026rsquo;s go back to the idea of how label noise results in false predictions. I\u0026rsquo;ve created a simulation to understand how label noise affects the predictions. The simulation first creates machine data with labels. As part of the data generation process, the amount of label noise can be specified. For example, if 10% label noise is specified, 10% of the \u0026ldquo;Issue Happened\u0026rdquo; labels are randomly flipped to \u0026ldquo;Issue Didn\u0026rsquo;t Happen.\u0026rdquo;\nAfter generating the data, the simulation trains a model on a training dataset, and makes a lot of predictions on a test data set. The goal is to generate Alarms and then determine the number of true and false positives. I then normalize the Alarms to 100 Alarms to make it easier to mentally calculate percentages when you look at the results. If you\u0026rsquo;d like to see the code, refer to the Jupyter Notebook\nLet\u0026rsquo;s start with a model that predicts a single issue. This model either predicts a zero (no issue is going to occur) or a one (the issue is going to occur). In the table below, the \u0026ldquo;unnecessary work\u0026rdquo; column is the false positives. I say unnecessary work because the false positive asks us to fix something that doesn\u0026rsquo;t need to be fixed.\nThese cost numbers are repeated here so you don\u0026rsquo;t have to scroll up. Cost of False Positive: $500 Cost of a True Positive: $250 Maximum Cost Savings with 100% True Positives: $25000\nNumber of Alarms Percentage Label Noise Unnecessary Work Financial Cost of False Alarms Final Costs Savings 100 0 1 $500 $24500 100 5 8 $4000 $21000 100 10 9 $4500 $20500 100 20 13 $6500 $18500 You can see the trend here. As the label noise increases, so does the false positives. And as the false positives increase, your actual cost savings goes down. How \u0026ldquo;bad\u0026rdquo; this is to your business case depends a lot on what your expectations are. If your predictive maintenance program only makes financial sense if you fully realize that $25K in cost savings, this table says your program isn\u0026rsquo;t going to work. Even in the case of 0% label noise, you are going to have false positives. Nobody has perfect data and you are not going to save $25K (per 100 Alarms) with proactive service.\nHowever, if your program makes sense at $18K in cost savings, your program might be viable. You just have to ensure your data quality is decent and you can keep your label noise under control.\nUnfortunately, the scenario where a machine only has one possible issue is very optimistic and isn\u0026rsquo;t realistic. Machines can fail in multiple ways, so let\u0026rsquo;s extend this scenario. Instead of our machine only having one issue we want to predict, we want to predict two issues, issues A and B. To make the math easier, let\u0026rsquo;s say the costs of issue B are the same as our previous costs for A, which is $500 for a retroactive fix and $250 in savings for a proactive fix. The model now has three possible predictions; no issue predicted, issue A predicted, or issue B predicted.\nAdding this issue B changes our conceptualization of a false positive in a seemingly minor way, but with strong financial implications. Remember, in the previous case, the cost of a false positive was $500. Let\u0026rsquo;s say our model predicts B is going to occur on a particular machine, but this is incorrect. Actually, issue A was going to occur. We now have spent $500 for the unnecessary work done to fix the issue B that was not going to occur and $500 to reactively fix issue A when it occurs in the future. So the cost of a false positive is $1000 in this case, not just $500.\nWe can look at the impact of this by performing a similar simulation to what was was one before. One of the benefits of doing this via simulation (as opposed to using real service data) is being able to see which of the false positives was unnecessary work (we predicted an issue when no issue was going to occur) and incorrect work (we predicted the wrong issue was going to occur). This allows us to split out which false positives cost $500 and which cost $1000 and we can better see the cost impact.\nAs a reminder, the maximum potential cost savings if all the Alarms are correct is $25K.\nNumber of Alarms Percentage Label Noise Unnecessary Work Incorrect Work Financial Cost of False Alarms Final Cost Savings 100 0 5 8 $10500 $14500 100 5 5 8 $10500 $14500 100 10 5 14 $16500 $8500 100 20 6 15 $18000 $7000 In this scenario, you can see over 50% of the potential cost savings is gone with just 10% label noise. Also note the \u0026ldquo;unnecessary work\u0026rdquo; is fairly stable. Most of the loses are from the \u0026ldquo;incorrect work\u0026rdquo; which is when the model predicts the wrong issue is going to occur, resulting in double the work.\nThe other interesting observation is how high the no noise baseline is compared to the previous model that only predicted a single issue. When predicting two issues, $10K of value is erased even when the labels are all correct. If you had this information before you built a model and were deciding if it was worth the effort to develop the model, would you do it?\nKeep in mind what you are seeing in this table is partly an artifact of the model and methodology I used. It\u0026rsquo;s possible to use different models and tune them so the effect of label noise is different (e.g. reducing the number of Alarms and increasing the accuracy of each Alarm, but at the cost of more missing issues/more false negatives). The trends you are seeing in this table are illustrative of a particular methodology, and are not the gold standard for what you will see with your models. You should validate this with your own data.\nAnswers to Possible Questions # What\u0026rsquo;s a realistic percentage for how much label noise I will see in my data?\nI\u0026rsquo;ve seen so much variance in data quality that it\u0026rsquo;s very difficult to estimate this. It\u0026rsquo;s possible for a single service dataset to have very different noise percentages for different issues or labels, so it\u0026rsquo;s challenging to provide a single number.\nIn my experience, 10% label noise is usually very good and very rare. It\u0026rsquo;s usually much worse.\nIf label noise is so pervasive, how can any predictive maintenance projects succeed?\nThe goal of this article was not to dissuade your efforts around predictive maintenance, but to demonstrate how you need to think about goals as a tradeoff, not an absolute. Instead of deciding that proactive maintenance is going to hit a particular cost savings target and building as many models as possible to hopefully hit that target, think about doing a rigorous analysis of your data and service costs to better understand how much you can really save. Figure out how many good alarms you need and if that is achievable with the data you have. Try to determine the best and worst case for how many bad alarms you might get, and if the financial impact of those will still allow your program to make sense financially. If your numbers still look good after doing that analysis, do a quick pilot project or run some experiments to see if you can validate those cost savings in the real world.\nI\u0026rsquo;ve talked about this in more detail in another article where I wrote about creating a success criteria for predictive maintenance projects.\nDoes a proactive service visit really cost the same as a reactive service visit?\nOne of the assumptions I made was a false positive costing the same as a reactive service visit. I made this assumption to make the math easier, but I don\u0026rsquo;t agree with it. I think the cost of a false alarm should be higher than the cost of doing the exact same service reactively because of the negative effect it has on your service operations.\nConsider the logistics of a service organization. At least in my experience, service technicians and other service personnel aren\u0026rsquo;t sitting around waiting for things to do. There is an opportunity cost to having them do unnecessary work. If a company has limited personnel and they are being assigned to a false positive, that probably means somebody else with a genuine issue is waiting for service technicians to become available. The technician could also be using up a limited stock of parts during an unnecessary service, resulting in a customer with a genuine issue waiting for a fix due a lack of parts. This opportunity cost needs to be accounted for somehow, and the cost of a false positive needs to be more than the cost of the same reactive service. I leave it to you to decide how much more this should be based on the service challenges you see in your company.\nInstead of using one model to predict two issues, can I use two models, each of which predicts a single issue? Will that reduce the number of false positives?\nThe answer to the first question is yes, you can build models that only predict a single issue. Many organizations do exactly this, as it can be easier to define the use case and also evaluate and track the quality of the Alarms this way.\nThe answer to the second question is less clear. Something to keep in mind is even if your model is built to predict only one issue, your machine and service data may not clearly separate out the issue you are predicting from other issues occurring at the same time. To better understand this, let\u0026rsquo;s go back to the example I gave about a cracked and leaking water pipe.\nTo detect a water leak, one of the ways to measure this is to track water pressures. However, these measurements are not likely to measure only the water pressure of a single hose, but the water pressure throughout your system. So if other issues are affecting the water pressure, those will influence the water pressure data being collected. This, in turn, will have an effect on your predictions since you can\u0026rsquo;t cleanly isolate other issues from the issue you are trying to predict. So it\u0026rsquo;s possible the financial costs of the false alarms from two models predicting separate issues will be the same as using a single model. You just have to try it and see which works better.\nTake Aways # If nothing else, there are two major things you should take away from this. The first is false positives, or incorrect Alarms, can quickly eliminate all the business value of your predictive maintenance efforts. Don\u0026rsquo;t let statements like \u0026ldquo;Our model is 85% accurate\u0026rdquo; trick you into thinking that 85% is giving you five times more value than the 15% is taking away. Do an honest assessment to see if you are really getting the value you need out of a predictive model.\nThe second take away is about the value of improving data quality. Data People frequently complain about underinvestment in data quality and how other people don\u0026rsquo;t really understand the problems caused by poor data quality. I think part of the reason this happens is because most people don\u0026rsquo;t look at raw data, they look at summary KPIs. And then unconsciously assume that if they can drive major business decisions with their KPIs, their data quality is good enough for everything else.\nAn example of this can be seen in the way people sometimes evaluate how machine issues are driving overall service costs. One way to approach this is using the pareto principle and focusing on the issues driving the most costs. You can take one year of data and add up the number of times each issue occurred, and multiply that by the financial costs of each issue. In this way, you\u0026rsquo;ll see which issues are costing you the most annually. Even If 20% of the issue labels are incorrect, the ranking of the individual items in the pareto might be a bit different, but the big cost drivers will probably still be the big cost drivers, and the small issues are less of a cost concern when compared to the big issues. So even with 20% label noise, your pareto breakdown is still correct. So you might assume the data is good enough for anything you want to do with it.\nAs we\u0026rsquo;ve seen, that 20% matters a lot when it comes to getting the business value you expected from predictive models. In many cases, fixing those data quality issues isn\u0026rsquo;t just about hiring a person to focus on cleaning up data, but about revamping the entire process and software interface used to collect data.\nAs a manager, it can be tiresome to listen to Data Scientists complain about data quality, especially when the quality seems fine for everybody else. Just remember, what they might be trying to tell you is that with your current data, your goal of saving X million dollars using predictive maintenance is just an unattainable dream. If you aren\u0026rsquo;t prepared invest in improving your data quality, it\u0026rsquo;s better if you know that today instead of waiting years for the program to fall flat on it\u0026rsquo;s face.\n","date":"20 May 2024","externalUrl":null,"permalink":"/posts/false-alarms-can-erase-the-value-of-your-predictive-maintenance-efforts/","section":"Posts","summary":"Introduction # People often underestimate how false alarms from predictive models can erase all the business value you set out to create from those models. In this post we will explore how label noise can drive up your service costs instead of decreasing it, and how a small fraction of incorrect predictions of impending machine issues can erase all the benefit created by correct predictions.\nIn my experience, many predictive maintenance projects follow this pattern.\nA financial analysis is performed to understand the current cost of servicing a machine or a fleet of machines. People look at different ways to reduce these costs, and will estimate how much proactive service (e.g. fixing issues before they happen, better scheduling of predicted service issues, etc) will reduce costs. If the estimated cost savings of proactive service are high enough, a predictive maintenance program will be proposed. An internal and/or external team of Data people will be created to build the infrastructure and models (or rules) to predict future machine issues. Models will be built and deployed for a year or two before the issues and cracks in value realization start to appear. This is assuming some external event (e.g. a recession) doesn’t force an early analysis of what is going on. The competent data people will start to recognize the theoretical financial benefit of each model is not being achieved. The honest ones will start openly grumbling about it, maybe not in the most productive ways. The lack of cost savings becomes apparent to service management. How quickly they realize (or acknowledge) it will be highly dependent on many factors. For example, if there are a lot of false Alarms causing a significant increase in costs, this will be recognized faster. Sometimes the growing effect of false alarms is more insidious and it takes longer for people to realize what is going on. At this point people will conduct a more rigorous analysis and realize the predictive models are not reducing costs. One benign scenario are the models generating zero or a very low number of alarms, so there is little or no proactive service occurring. A more damaging situation is if there are a lot of incorrect alarms and people are doing work that doesn’t need to be done, actually driving up service costs instead of reducing it. We’ll talk more about this later in this article. To help fix the situation, one way forward is for the service and data people to work together to come up with a more accurate way of evaluating the real financial benefit of each model. They then rework or scrap existing models. In a worst case scenario the entire predictive maintenance program is scrapped, people are laid off, and nobody really understands the root cause, dooming this whole process to be repeated in the future. This article is about the latter case in point 6. My goal is to provide insight into how bad Alarms, or false positives, from predictive models can drive up service costs, even in the case where a model is providing a lot of good/accurate Alarms. This will hopefully, help shorten the time between points 4 and 7 above. In an ideal situation, these types of issues will be identified in step 4 before a model is even put into production, so the service people never see a growing divergence between the theoretical and actual cost savings provided by a model.\nThe term Alarm is used in this article and defining it will help clear up any questions about what it means. An Alarm means a prediction from a predictive model indicating some event or failure is going to occur. Predictive Models can also predict something is not going to occur, but this is not be considered an Alarm in this article. Every prediction is not an alarm. Only the predictions of some undesirable event happening are considered an Alarm.\nWhat is Label Noise? # Imagine you have a machine with a water hose, and one of the ways this hose fails is by cracking and leaking. As a Data Scientist, you are asked to predict this issue before it occurs. To do this, you go into the machine service data to find historical examples where this issue has occurred so you can see what the machine data looked like right before the hose cracked. You can then compare this to machine data where this crack did not occur.\nThe service data is the historical record of issues customers have had with their machines, and what work was done to fix these issues. For this particular issue, when a service technician goes onsite to fix it, they enter the cause of the issue in the service ticket, and what they should be entering is some variation of “cracked hose.” However, during at least one instance where this issue occurred, the technician entered “fluid leak” instead, making it challenging to clearly identify if it was the hose or something else that was leaking.\nWhen you train a model on this data, what you are doing is providing the model with examples of when the pipe cracked and when it did not, and tasking the computer to learn the difference between the machine conditions when cracks occur and when they do not. The more examples you have, the better the model is able to determine the difference.\nThe examples and data you provide to train the model are typically created (e.g. Feature Engineering) by an Analyst or Data Scientist, and they are based on the information in the historical service data. However, when the service data has errors or lacks clarity, you can end up in a situation where the examples are incorrect. This results in the model learning the wrong things from the examples, which leads to incorrect predictions. With more incorrect examples, you’ll get more incorrect predictions.\nEach example used to train this type of predictive maintenance model has a Label indicating if an issue occurred or not. And this Label is how the model’s algorithm can tell the difference between machine data indicating a crack will occur, or if it probably won’t occur. In an ideal world, the labels would be 100% correct. But as you can see from the example of the service technician incorrectly entering the cause of the issue, it’s extremely difficult to ensure all the labels are correct. This results in Label Noise in our data.\nYou might ask, “Can’t a Data person manually check the labels to make sure they are accurate?” It might be possible to do this with a very simple machine, but complex machines can have a lot of machine and sensor data, making it impossible for a person to manually check. There could also be an entire fleet of machines, all working in very different environments (e.g. different geographies and climates) making the data more complicated. So there’s no realistic way for a person to manually validate every label.\nLabel Noise Sources # For people who are curious, here are some possible sources of label noise in service data.\nSome possible sources of label noise or incorrect entries in service data.\nMisdiagnosed problems (e.g. thinking a hose is leaking due to a crack, when it’s actually incorrectly connected to the machine). Problems being diagnosed by trial and error resulting in confusing and contradictory information in the service ticket. Using ambiguous words to describe the work that was done. An example is using the word “fix” when a part is replaced. Yes, the issue was fixed, but the fix was to “replace” something, not perform some repair on that part. Limitations in the software used to enter the service information Example: A technician has to fix multiple issues, but their ticketing system only allows them to enter one issue per ticket, and they don’t want to open multiple tickets. A technician using a pre-populated drop down and being forced to pick the best option, not the right one. Accidently picking the wrong item in a drop down because they are using a tablet and their fingers. Issues arising from multiple languages being present in the data, and words having different meanings in different languages. The above is not a comprehensive list. Going into the details of where this label noise comes from is beyond the scope of this article, and is sometimes very specific to the type of machine and process you are working.\nThe Financial Benefits of Predictive Maintenance # Before I get into the impact of label noise, let’s talk about the theoretical financial benefits of predictive maintenance.\nIt’s important to recognize for commercial machines, the baseline maintenance cost is not $0. Instead, you actually start from a negative. Imagine you have a machine requiring, at a minimum, a quarterly service which costs $2500 for parts and labor. That machine costs you $10K a year to run, so it needs to generate at least $10K in yearly value for you to break even.\nLet’s now say the company building that machine says they have redesigned one of the parts, and now you only need to replace it once every six months instead of replacing it every quarter. Replacing this part twice a year instead of once a quarter will reduce your costs by $1k. So now, instead of spending $10K/year, you are spending $9K/year.\nIn the real world, machines are not only serviced during regular service intervals, but also when things break or are not running correctly. This is what is known as reactive service. Basically, when there is a machine issue, somebody reacts by fixing it themselves, calling somebody else to fix it, etc. Reactive Service happens as a reaction to some issue with the machine.\nThis combination of regular and reactive service is your baseline for service costs. If you find you have $10K in regular planned service and $5k in reactive service, your total cost is $15K. If that is your cost per machine and you have 10 machines, your annual service costs are $150K. So when you say you want to reduce your costs, $150k is the baseline you want to improve upon.\nReactive service disrupts normal operations and drives up costs. It’s no different than your home cooling system breaking in the middle of summer during a heatwave, as compared to it breaking during a mild spring. In the former case, you will struggle to find somebody to fix it, parts may be expensive or not available, all the while you are living in your uncooled house. You’ll essentially be forced to pay the highest price required to get it fixed quickly. If the weather is mild and you have to wait a few days for the fix to save a lot of money, it’s not an issue.\nThis is the benefit of proactive and predictive maintenance. Instead of reacting to issues, you get ahead of issues. Having this advance information gives you the ability to plan and schedule things in a way that allows you to reduce your costs and the impact on your operations.\nNow that we’ve got the background covered, let’s talk about false predictions.\nThe Cost of Incorrect Predictions # False Negatives Might be Okay # Imagine you have a model continuously ingesting all your machine data and predicting if an issue will occur. This model runs twenty four hours a day, seven days a week, and generates hundreds, if not thousands, of predictions a month. This model is also terrible, and never predicts any issues are going to occur. In other words, it generates zero Alarms. What is the financial impact to your service costs?\nThe answer is the impact is basically zero. Having a model that never predicts an issue is going to occur is the same as not having a model at all. Before you had a model, all issues were dealt with reactively and your baseline service costs were based on reactive service. If a model always predicts nothing is going to happen, you’re only going to fix things reactively after the issue occurs. So your service costs with this terrible model are the as same before you had it.\nWhen a model falsely says an issue will not occur but it does occur, this is considered a False Negative. We’re not going to talk about false negatives any more in this article, but it’s good to understand what it is so we can mentally contrast it with a false positive.\nFalse Positives are the Real Problem # A False Positive is when your model predicts an issue is going to occur, but it wasn’t actually going to occur. If you perform service activities based on that prediction, you are basically doing work that doesn’t need to be done and replacing parts that do not need to be replaced. A false positive is much worse than a false negative because unlike the latter, which has minimal impact on your overall service costs, a false positive increases your costs. A lot of false positives can drive up your costs significantly.\nLet’s do some armchair math to understand how much of an impact this can have. Let’s say you have an issue which occurs 100 times a year. Dealing with this issue reactively costs you $500 per occurrence, or $50K a year. You look into the benefit of dealing with this proactively, and you conclude that instead of $500, a correct prediction results in your service cost being $250 (parts and labor). So you save $250 compared to the reactive service cost. However, every false positive will cost you $500 because you are doing unnecessary work, and the parts and labor costs are the same as if you had done this reactively.\nBased on those numbers we have the following calculations\nNumber of Alarms True Positives False Positives Value of True Positives False Positive Costs Total Financial Benefit of the Model 100 100 0 $25000 $0 $25000 100 95 5 $23750 $2500 $21250 100 90 10 $22500 $5000 $17500 100 80 20 $20000 $10000 $10000 100 70 30 $17500 $15000 $2500 It’s important to notice how quickly the financial benefit of the model gets cut in half. You might think eighty correct predictions and twenty incorrect predictions is great and the model is doing really well, but it’s really not. The value of the model quickly goes to zero, or even negative, as the number of false positives goes up.\nHow Label Noise Increases False Positives # Now that we have discussed the cost of a false positive, let’s go back to the idea of how label noise results in false predictions. I’ve created a simulation to understand how label noise affects the predictions. The simulation first creates machine data with labels. As part of the data generation process, the amount of label noise can be specified. For example, if 10% label noise is specified, 10% of the “Issue Happened” labels are randomly flipped to “Issue Didn’t Happen.”\nAfter generating the data, the simulation trains a model on a training dataset, and makes a lot of predictions on a test data set. The goal is to generate Alarms and then determine the number of true and false positives. I then normalize the Alarms to 100 Alarms to make it easier to mentally calculate percentages when you look at the results. If you’d like to see the code, refer to the Jupyter Notebook\nLet’s start with a model that predicts a single issue. This model either predicts a zero (no issue is going to occur) or a one (the issue is going to occur). In the table below, the “unnecessary work” column is the false positives. I say unnecessary work because the false positive asks us to fix something that doesn’t need to be fixed.\nThese cost numbers are repeated here so you don’t have to scroll up. Cost of False Positive: $500 Cost of a True Positive: $250 Maximum Cost Savings with 100% True Positives: $25000\nNumber of Alarms Percentage Label Noise Unnecessary Work Financial Cost of False Alarms Final Costs Savings 100 0 1 $500 $24500 100 5 8 $4000 $21000 100 10 9 $4500 $20500 100 20 13 $6500 $18500 You can see the trend here. As the label noise increases, so does the false positives. And as the false positives increase, your actual cost savings goes down. How “bad” this is to your business case depends a lot on what your expectations are. If your predictive maintenance program only makes financial sense if you fully realize that $25K in cost savings, this table says your program isn’t going to work. Even in the case of 0% label noise, you are going to have false positives. Nobody has perfect data and you are not going to save $25K (per 100 Alarms) with proactive service.\nHowever, if your program makes sense at $18K in cost savings, your program might be viable. You just have to ensure your data quality is decent and you can keep your label noise under control.\nUnfortunately, the scenario where a machine only has one possible issue is very optimistic and isn’t realistic. Machines can fail in multiple ways, so let’s extend this scenario. Instead of our machine only having one issue we want to predict, we want to predict two issues, issues A and B. To make the math easier, let’s say the costs of issue B are the same as our previous costs for A, which is $500 for a retroactive fix and $250 in savings for a proactive fix. The model now has three possible predictions; no issue predicted, issue A predicted, or issue B predicted.\nAdding this issue B changes our conceptualization of a false positive in a seemingly minor way, but with strong financial implications. Remember, in the previous case, the cost of a false positive was $500. Let’s say our model predicts B is going to occur on a particular machine, but this is incorrect. Actually, issue A was going to occur. We now have spent $500 for the unnecessary work done to fix the issue B that was not going to occur and $500 to reactively fix issue A when it occurs in the future. So the cost of a false positive is $1000 in this case, not just $500.\nWe can look at the impact of this by performing a similar simulation to what was was one before. One of the benefits of doing this via simulation (as opposed to using real service data) is being able to see which of the false positives was unnecessary work (we predicted an issue when no issue was going to occur) and incorrect work (we predicted the wrong issue was going to occur). This allows us to split out which false positives cost $500 and which cost $1000 and we can better see the cost impact.\nAs a reminder, the maximum potential cost savings if all the Alarms are correct is $25K.\nNumber of Alarms Percentage Label Noise Unnecessary Work Incorrect Work Financial Cost of False Alarms Final Cost Savings 100 0 5 8 $10500 $14500 100 5 5 8 $10500 $14500 100 10 5 14 $16500 $8500 100 20 6 15 $18000 $7000 In this scenario, you can see over 50% of the potential cost savings is gone with just 10% label noise. Also note the “unnecessary work” is fairly stable. Most of the loses are from the “incorrect work” which is when the model predicts the wrong issue is going to occur, resulting in double the work.\nThe other interesting observation is how high the no noise baseline is compared to the previous model that only predicted a single issue. When predicting two issues, $10K of value is erased even when the labels are all correct. If you had this information before you built a model and were deciding if it was worth the effort to develop the model, would you do it?\nKeep in mind what you are seeing in this table is partly an artifact of the model and methodology I used. It’s possible to use different models and tune them so the effect of label noise is different (e.g. reducing the number of Alarms and increasing the accuracy of each Alarm, but at the cost of more missing issues/more false negatives). The trends you are seeing in this table are illustrative of a particular methodology, and are not the gold standard for what you will see with your models. You should validate this with your own data.\nAnswers to Possible Questions # What’s a realistic percentage for how much label noise I will see in my data?\nI’ve seen so much variance in data quality that it’s very difficult to estimate this. It’s possible for a single service dataset to have very different noise percentages for different issues or labels, so it’s challenging to provide a single number.\nIn my experience, 10% label noise is usually very good and very rare. It’s usually much worse.\nIf label noise is so pervasive, how can any predictive maintenance projects succeed?\nThe goal of this article was not to dissuade your efforts around predictive maintenance, but to demonstrate how you need to think about goals as a tradeoff, not an absolute. Instead of deciding that proactive maintenance is going to hit a particular cost savings target and building as many models as possible to hopefully hit that target, think about doing a rigorous analysis of your data and service costs to better understand how much you can really save. Figure out how many good alarms you need and if that is achievable with the data you have. Try to determine the best and worst case for how many bad alarms you might get, and if the financial impact of those will still allow your program to make sense financially. If your numbers still look good after doing that analysis, do a quick pilot project or run some experiments to see if you can validate those cost savings in the real world.\nI’ve talked about this in more detail in another article where I wrote about creating a success criteria for predictive maintenance projects.\nDoes a proactive service visit really cost the same as a reactive service visit?\nOne of the assumptions I made was a false positive costing the same as a reactive service visit. I made this assumption to make the math easier, but I don’t agree with it. I think the cost of a false alarm should be higher than the cost of doing the exact same service reactively because of the negative effect it has on your service operations.\nConsider the logistics of a service organization. At least in my experience, service technicians and other service personnel aren’t sitting around waiting for things to do. There is an opportunity cost to having them do unnecessary work. If a company has limited personnel and they are being assigned to a false positive, that probably means somebody else with a genuine issue is waiting for service technicians to become available. The technician could also be using up a limited stock of parts during an unnecessary service, resulting in a customer with a genuine issue waiting for a fix due a lack of parts. This opportunity cost needs to be accounted for somehow, and the cost of a false positive needs to be more than the cost of the same reactive service. I leave it to you to decide how much more this should be based on the service challenges you see in your company.\nInstead of using one model to predict two issues, can I use two models, each of which predicts a single issue? Will that reduce the number of false positives?\nThe answer to the first question is yes, you can build models that only predict a single issue. Many organizations do exactly this, as it can be easier to define the use case and also evaluate and track the quality of the Alarms this way.\nThe answer to the second question is less clear. Something to keep in mind is even if your model is built to predict only one issue, your machine and service data may not clearly separate out the issue you are predicting from other issues occurring at the same time. To better understand this, let’s go back to the example I gave about a cracked and leaking water pipe.\nTo detect a water leak, one of the ways to measure this is to track water pressures. However, these measurements are not likely to measure only the water pressure of a single hose, but the water pressure throughout your system. So if other issues are affecting the water pressure, those will influence the water pressure data being collected. This, in turn, will have an effect on your predictions since you can’t cleanly isolate other issues from the issue you are trying to predict. So it’s possible the financial costs of the false alarms from two models predicting separate issues will be the same as using a single model. You just have to try it and see which works better.\nTake Aways # If nothing else, there are two major things you should take away from this. The first is false positives, or incorrect Alarms, can quickly eliminate all the business value of your predictive maintenance efforts. Don’t let statements like “Our model is 85% accurate” trick you into thinking that 85% is giving you five times more value than the 15% is taking away. Do an honest assessment to see if you are really getting the value you need out of a predictive model.\nThe second take away is about the value of improving data quality. Data People frequently complain about underinvestment in data quality and how other people don’t really understand the problems caused by poor data quality. I think part of the reason this happens is because most people don’t look at raw data, they look at summary KPIs. And then unconsciously assume that if they can drive major business decisions with their KPIs, their data quality is good enough for everything else.\nAn example of this can be seen in the way people sometimes evaluate how machine issues are driving overall service costs. One way to approach this is using the pareto principle and focusing on the issues driving the most costs. You can take one year of data and add up the number of times each issue occurred, and multiply that by the financial costs of each issue. In this way, you’ll see which issues are costing you the most annually. Even If 20% of the issue labels are incorrect, the ranking of the individual items in the pareto might be a bit different, but the big cost drivers will probably still be the big cost drivers, and the small issues are less of a cost concern when compared to the big issues. So even with 20% label noise, your pareto breakdown is still correct. So you might assume the data is good enough for anything you want to do with it.\nAs we’ve seen, that 20% matters a lot when it comes to getting the business value you expected from predictive models. In many cases, fixing those data quality issues isn’t just about hiring a person to focus on cleaning up data, but about revamping the entire process and software interface used to collect data.\nAs a manager, it can be tiresome to listen to Data Scientists complain about data quality, especially when the quality seems fine for everybody else. Just remember, what they might be trying to tell you is that with your current data, your goal of saving X million dollars using predictive maintenance is just an unattainable dream. If you aren’t prepared invest in improving your data quality, it’s better if you know that today instead of waiting years for the program to fall flat on it’s face.\n","title":"False Alarms can Erase the Value of your Predictive Maintenance Efforts","type":"posts"},{"content":" Assumptions # When conceptualizing and implementing a predictive maintenance project, it can be hard to grasp the entire chain of people and technology needed for success. In this post I\u0026rsquo;ll try to break down all the people and teams needed for this type of project, and also provide insight into how these people work together.\nAn alternate title I had for this post was \u0026ldquo;How much is my predictive maintenance program really going to cost?\u0026rdquo; The goal is not give a dollar amount, but really to understand how all the different people and pieces fit together so you can fully conceptualize the scope of one of these programs.\nLets start with the assumptions. This post is focused on projects that actually make it to \u0026ldquo;production\u0026rdquo;, which means they generate some real financial benefit. R\u0026amp;D work, proof-of-concepts, exploratory activities, etc. are all valuable, but they have working patterns that can vary quite a bit from what I\u0026rsquo;m describing here.\nThis post focuses on predictive maintenance projects addressing specific use cases, not general purpose solutions. A general purpose solution is something like an anomaly detection SaaS trained on a variety of machine data, but it needs to be customized for each use case. Developing and using this type of SaaS is different from what this post is addressing. I\u0026rsquo;ll provide an example of what I mean by \u0026ldquo;specific use case\u0026rdquo; soon.\nI\u0026rsquo;ve also pointed at a \u0026ldquo;Data Scientist\u0026rdquo; as the person who implements these projects, but more realistically it could be any person with data and software skills. Job titles mean different things at different companies and there nothing stopping a Data Analyst or Data Engineer from doing these things. So don\u0026rsquo;t take my using the words \u0026ldquo;Data Scientist\u0026rdquo; as some sort of gold standard for who needs to do this work.\nFor the sake of simplicity, let\u0026rsquo;s split these projects into two types, low and high complexity projects. \u0026ldquo;Complexity\u0026rdquo; refers to the number of people and the the computing environment used to develop and deploy a model, and also to evaluate and act on Alarms (the model inference/predictions). Complexity is not referring to complexity of the actual model.\nA Low complexity project consists of a smaller group of people working on a focused use case, with a more manual process around evaluating and acting on alarms. A high complexity project has a lot more people and technology in the development of use cases and evaluating and acting on alarms. Again, it\u0026rsquo;s not about the actual model, but everything around the model.\nA Low Complexity Project # Let\u0026rsquo;s start with an example. A Subject Matter Expert (SME) identifies a machine issue (a failure mode) they\u0026rsquo;d like to predict, and proposes rules that could be used to predict this. They work with a Data Scientist to codify (e.g. using Python) and test those rules, and both of them work together to evaluate and adjust those rules to improve predictive performance. Once they are both happy with the results, the Data Scientist deploys the \u0026ldquo;production\u0026rdquo; script into a system that runs it once a day. After the script runs, it emails a list (also known as alarms) of all systems that violated those rules to the SME. When the SME gets the email, they evaluate the alarms by investigating the machine sensor data. For any systems with genuine issues, they manually create a service ticket for the relevant system so somebody can intervene.\nThere are a few points I\u0026rsquo;d like to highlight.\nThe \u0026ldquo;deployment\u0026rdquo; system could be as simple as a laptop running the script once a day using the operating system scheduler. Granted, with cloud services everywhere, a laptop is less likely than the cloud equivalent (e.g. a small cloud VM). The \u0026ldquo;production\u0026rdquo; script basically fetches some data (from a database, from a file), does something with this data, and sends an email. It\u0026rsquo;s not necessarily built with software engineering or DevOps best practices, logging, or even very tolerant of failure. The impact of the script failing is low. If the laptop turns off and the script doesn\u0026rsquo;t run for a few days, only the SME will notice and care. So there limited need to harden the script and the deployment environment to prevent these types of issues. The \u0026ldquo;monitoring\u0026rdquo; is done by the SME, and they ensure bad alarms don\u0026rsquo;t reach the service personnel. If the SME goes on holiday for two weeks and nobody looks at the alarms, it doesn\u0026rsquo;t result in a organizational or contractual issue. The complexity of this project is low because the organizational overhead is low. The entire process fits on a single slide, and the entire project is executed by two people. Any support needed from other people is minor (e.g. somebody in an IT role needs to allow the script to send emails).\nScaling This # One major challenge is this entire process being bottlenecked by these two people. A few sick days means alarms are no longer checked or tracked. If either of them have other job responsibilities taking up a lot of mental space, they may not remember, or have the time, to check the alarms or if the script is still running.\nOnce you get past a handful of use cases, it\u0026rsquo;s difficult to sustain this low complexity model. If nothing else, the Service organization is going to ask why SMEs are suddenly creating service tickets based on a \u0026ldquo;predictive model\u0026rdquo;, and will also ask for a business justification for this new way of working. The SME\u0026rsquo;s management structure will need to step in. When two different parts of an organization need to work together, a process will be need to be defined.\nUnless your goal is to stay small, trying to scale and resolve these types of issues is what changes this from a low complexity project to a high complexity one. To scale, you\u0026rsquo;ll need to bring in other people and other groups and find a way for them to work together.\nA High Complexity Project/Program # Here\u0026rsquo;s an example reflecting a more complex situation. A team of SMEs come up with failure modes they\u0026rsquo;d like to predict, and each of these failure modes is given a business value to help prioritize which to work on. A team of Data people work with them to build resilient data pipelines and rules/models. As part of the development process, every project is tested for a fixed period of time, and the quality of alarms are summarized using predefined metrics. Only the models passing a defined target can then be moved to a standardized and supported production system. To move the project to production, there are requirements the code has to meet for logging, input/output, and so it is resilient to upstream/downstream issues. Once the project is in production, the alarms go directly to another business team who monitor and act on them. The data, SMEs, and business teams are also all involved in monitoring alarm quality over time to catch any changes in behavior.\nFor the sake of clarity, let\u0026rsquo;s define all the teams of people above.\nThe SME Team: The fundamental goal of this team is to come up with ideas or use cases, and to provide support when the Data team builds solutions. This is typically a cross-functional team and they may only do this part time (e.g. 90% of their job is something else). So from an organizational perspective, they report to different managers and have different overall goals for their jobs. They may have mixed backgrounds and motivations and aren\u0026rsquo;t always aligned with each other.\nFor example, one member of this team could be from Customer Service, with a focus on issues directly affecting customers. Another person could be from Engineering, and they are focused on hardware issues causing machine damage, even if the occurrence of these issues are low. A third person might be very finance focused and interested in issues which cost the company the most money, regardless of customer impact. Keep in mind a company can have multiple products, so you might have multiple SME teams competing to get their voices heard and use cases prioritized.\nThe Data Team: This team will have three core functions. 1). Doing data things such as building data pipelines, analyzing it, building models, etc. 2). Building/Maintaining a development and production infrastructure. 3). Ensuring the outputs of the previous two items are useful to other teams. This could include dashboards so alarms can quickly be checked and validated against other data (e.g. machine sensor data), an app to track and handle alarms, documentation, etc.\nPart of this is working with other teams who own the upstream systems (e.g. the raw machine data), and also the downstream systems (e.g. the service ticketing system) so all the production infrastructure can talk to these systems in an efficient and standardized way. The infrastructure also needs to be resilient to failures, so any upstream and downstream issues do not result in the production system crashing and losing alarms, or worse, generating incorrect alarms.\nThe Service and Alarm team: After the alarms are generated, someone has do something with them. This could be a single specialized team, or an entire global organization with regional teams, different IT systems, and different operational procedures. The people on this team could be call center personnel, onsite technicians, service SMEs (e.g. technical experts), or managers who are looking at the business side of things. This team defines what should be done with those alarms when they get them, and they define what they need from the other teams so they can take effective and prompt action. The Service team also monitors the alarm quality in the short and long term to ensure they are not wasting time and money on bad alarms.\nGoverning Team: This team is a necessary evil to ensure the teams above are aligned and all working towards the same goal. I say necessary evil because a Governing team will slow things down in the short-term and make it harder to take fast decisions. At the same time, a good Governing team will help ensure the project doesn\u0026rsquo;t go in the wrong direction for eighteen months and then get shut down. With so many cross-functional teams, it\u0026rsquo;s almost inevitable people will focus on their own goals, even if those goals are contradictory to the overall project being successful. This team defines standards and procedures to follow, and they enforce a quality bar ensuring everybody knows what the targets are.\nUnfortunately, you can also have a bad Governing team which creates misery in the short-term and failure in the longer-term. So having a Governing team is not automatically a solution to cross-functional drama, and they can make things worse, resulting in a lot of passive-aggressive behavior and the inevitable failure of the project.\nHidden Teams: There are also other teams involved in this process that aren\u0026rsquo;t always at the forefront, but they are critical to it\u0026rsquo;s success. An obvious example are the people who maintain all the IT systems (e.g. Databases, ERP/Ticketing, Dashboard/BI). They are the ones who built these systems in the first place, and they are the ones who fix issues and implement changes. You also have Data Engineering teams, who build pipelines to ingest raw data and make is useable to other teams. Software Vendors also hide behind all of this. A documentation team may also be needed to take what is created by these individual teams and expose it to the wider world. And if you are selling predictive maintenance as a service, don\u0026rsquo;t forget sales and marketing.\nHow These People and Teams Work Together # SMEs and the Data Team: # Typically the people who spend the most time working together are SMEs and Data Scientists. This is because these are the ones who conceptualize and build a solution. At a bare minimum, SMEs have the domain knowledge to identify, support, and justify a predictive maintenance use case. And a Data Scientist has the machine learning and software skills to figure out how to, and if it\u0026rsquo;s possible, to build a solution they can apply to a fleet of machines.\nHere\u0026rsquo;s a example of how they work together. An SME is tasked with reducing unplanned downtime on customer machines. They look at the historical service data and identify the top 30 issues leading to the most downtime. They combine this with their domain (e.g. knowledge of the the machine hardware) expertise to filter this list down to issues where it makes sense to try and proactively address them, and they suspect could be predicted using a model. Based on some further business and financial criteria, these use cases are prioritized.\nThis prioritized set of use cases is typically where the Data Scientist starts. They work directly with the SMEs, who provide all the knowledge about the use case, the machine, and the parts/components and how they are related to the failure to be predicted. This can include information about the software logic controlling the machine and generating the machine data. Working together, they also find examples in the historical data of what is supposed to predicted. Using this information, the data person tries to build a model that can be applied to every machine in the field. It\u0026rsquo;s also the Data Scientist\u0026rsquo;s responsibility to evaluate the predictive quality of their models, and validate these quality results with the SMEs.\nIn my experience, what I\u0026rsquo;ve described in the last two paragraphs takes up the bulk of the time with any predictive maintenance project. Unfortunately, this is also the most unpredictable from a time perspective. I\u0026rsquo;ve been involved in projects which went from first conversation to production in 5 weeks, then a similar use case on a different part of the machine went in circles for months. This back-and-forth between the SMEs and Data Scientist is fundamentally applied research, not engineering. Some unpredictability is expected, and I think the best way to deal with it is to set time limits. After a predefined amount of time (e.g. three weeks), you move to a different use case or provide a very solid justification of why you need to keep working on the current one.\nDevelopment Team and Deployment Team # Once the SME and Data person agree they have a good model, it needs to be deployed. There are different ways to do this. At some companies the model development and deployment team are distinct, and the development team hands it off to the deployment team, who puts it into production. In others, the developers take their development code and make any needed changes so it can be moved to production.\nHowever people choose to do it, there should be a defined process stating guidelines, requirements, and expectations on a software/code and business level (e.g. what needs to be provided to the documentation team, any required approvals, etc.). Moving the code to production should take a lot less people hours than the development of the model.\nData Team and the Alarm/Service Team: # This relationship is actually the the most important one for the Data team\u0026rsquo;s success. alarms are useless unless somebody effectively acts on them. Even if a model generates perfect alarms, the value is zero (or negative!) unless somebody does something with them. On the other side, if the predictive quality of the alarms are mediocre, the Service team will lose faith in them and ignore them. While it\u0026rsquo;s easy to think the SME team is the customer because they propose the use cases and approve the models, the customer is actually the Service team because they use what has been built. If your customer is unhappy, they won\u0026rsquo;t act on the alarms. If that happens, the Data team has failed.\nThe working relationship between these two teams is oriented around a defined process of what service needs to successfully act on an alarm. For example, they might say every model/alarm needs to come with a predefined set of diagnostic and action steps a service technician needs to do. This could also include educational training so the service personnel understands the purpose of an alarm and what to do with the predefined action steps. It\u0026rsquo;s then the responsibility of the Data team, SME team, and whomever else, to make sure these deliverables are created before a model goes into production. If the Data team wants to succeed, it\u0026rsquo;s their responsibility to ensure the process steps, deliverables, and information given to the Service team enable them to successfully act on the alarms.\nThe technical and business processes between these two teams should be defined upfront and should be revisited every 3-6 months. This should include a quality review of the alarms from the perspective of the Service team. Waiting longer than 3-6 months means any flaws in this process risk becoming festering wounds. So it\u0026rsquo;s important to bring these issues to light quickly.\nWhen things are going well, there should be far less interaction between the Data and Service team when compared to the Data and SME team. The \u0026ldquo;normal\u0026rdquo; working interaction should be following the existing process steps as the Data team moves something to production and the Service personnel take over. Divergences from this should typically only occur if there is an issue with the process and the teams are working together to fix it.\nGoverning Team and Everybody Else: # In theory, the Governing team has two broad goals.\nDefining how all these teams are going to work together. Defining guidelines and processes so the overall project can succeed. To perform those two points successfully, there should be success criteria, such as quality metrics for models/alarms and what quality level is acceptable for a production model. They should also define things like the minimum requirements for the financial and business viability of use cases. The Governing team should meet once a quarter to discuss all these topics, and have an open forum so people can voice process concerns. This feedback should be used to fix issues to ensure projects are successful.\nA Governing team should also be able to recognize, or at least acknowledge, the size and importance of a problem and the time and organizational investment required to address it. It\u0026rsquo;s easy to see small problems as big ones due to a loud voice, or big problems as small ones because nobody understands the scope of it. An example of this is the Governing body asking a cross-functional team to define a quality metric to decide when a model is acceptable for production. This seems like a small task, but with different incentives, politics, and different perceptions about data quality across organizational boundaries, this seemingly small task can turn into prolonged war. It\u0026rsquo;s important for the Governing body to be able to step in and recognize this is an important problem they need to help solve, and not just treat it as a small ignorable side-effect of corporate politics.\nIn my view, the Governing team should meet no more than once a month to review the current working process and to allow people to formally raise concerns. If there is an issue requiring further work, a working group (who meets more often in the short-term) can be formed to address it. If the Governance team is meeting very often (e.g. every week) for an extended period, your process is broken. The whole point is for them to come up with a process for the different teams to effectively work together without constant intervention. If the Governance team members need to meet with the other teams every week to deal with process issues, it means your process isn\u0026rsquo;t working and the process for coming up with your governing framework isn\u0026rsquo;t working either. It\u0026rsquo;s okay to have a two week recurring meeting when you first start a predictive maintenance program and you need to make sure the framework is right. But if you are still meeting every two weeks a year later, maybe your governance team needs their own governance team.\nHidden teams and Everybody Else: # In my experience, the interaction between the Hidden and Data teams is very task oriented. For example, if there is an authentication issue on a cloud cluster, a ticket is filed and somebody from the Cloud team supports this. Once that issue is resolved, the interaction on the task stops. The duration of these interactions depends on the size of the issue. For example, an issue requiring a longer fix is when the documentation team needs to update the action plan for every single model deployed in the last three years. This interaction will last longer than smaller issues, but it still stops once the task is complete. In the grand scheme of things, the interaction with these Hidden teams is a small fraction of the total interaction with the other teams.\nPricing # Let\u0026rsquo;s now go back to how I started this post. To accurately estimate how much a predictive maintenance program costs, you should take the following things into account.\nThe full or fractional salaries of SMEs (maybe teams of SMEs) Data People Service Team Other teams needed to support the effort To calculate the fractional salaries, you\u0026rsquo;ll need to account for time spent in Meetings internal to each team Meetings between teams on short-term topics (e.g. developing a use case) Alignment meetings and Governing body meetings for longer-term topics Time spent following processes, completing templates, and other annoying things of this nature Support Costs, which include Fixing issues with old models Updating existing models due to data changes Updating infrastructure code to deal with upstream and downstream changes Responding to support tickets (e.g. a broken dashboard or alarm that seems weird) Infrastructure Costs, which is the compute/storage and/or hardware/software costs of running all of this Just to be clear, when I say \u0026ldquo;estimate the cost\u0026rdquo;, I don\u0026rsquo;t literally mean you sit down with a spreadsheet and insert items like \u0026ldquo;Person X = hourly salary * 2 hours of meetings + \u0026hellip;\u0026rdquo; This is simply an order of magnitude analysis to see if all of this makes sense. If you have a low complexity project setup, your team is two people, and your infrastructure costs are $10/month, then having a use case that saves $500k a year makes sense. For a high complexity project, saving $1 million a year doesn\u0026rsquo;t make sense. When you have a program with lots of teams and people, even saving $5 million a year might not be worth it.\nBad Success Criteria will Kill Your Program # For any predictive maintenance projects to succeed, there needs to be a well defined success criteria so everybody understands the goal. If your success criteria is not clear or is something vague and long term (e.g. the total target cost savings over two years is X million dollars), individual teams are going to interpret this in a way favoring their own metrics, and a lot time is going to be wasted in constant realignment.\nA concrete success criteria will be based on the actual service and alarm data. This will enable you to properly estimate your costs and cost savings for each alarm. It should have points like this. Before Developing a Model\nTo be approved for development, each use case needs to clearly show at least $75k in savings This needs to be shown from a time and materials perspective What is the dollar value of a good alarm (true positive). In other words, when the model correctly predicts an future issue, what is the cost savings of proactively addressing the issue? What is the dollar value of a bad alarm (false positive). In other words, when the model predicted an issue would occur but it did not, what would it cost us (e.g. the Service team) to perform an unnecessary service? What estimated threshold of good to bad alarms is acceptable? For example, if the only way for a use case to financially work is to have 97% good alarms and only 3% bad alarms, I can almost guarantee you will not hit it. Service data tends to be messy, and it\u0026rsquo;s extremely difficult to reach that without extremely clean or manipulated data. Do all the teams agree these estimates are accurate? After Developing a Model (Proposing it for Production) Back testing over six months of historical data, add up the cost savings from the good alarms and the costs from the bad alarms. Are you seeing the expected savings? Do all the teams agree these estimates are accurate? Notice I\u0026rsquo;ve included \u0026ldquo;Do all the teams agree these estimates are accurate?\u0026rdquo; both before and after the model is developed. This is because different teams can have wildly different interpretations of value, even when they are looking at the same data. I\u0026rsquo;ve seen cases where a Service team has looked at a use case with a seemingly good cost savings and said \u0026ldquo;Due to the time required to fix this, we don\u0026rsquo;t have the personnel to proactively fix this issue. We will only fix this during quarterly maintenance when we take the machine apart anyway.\u0026rdquo; Insights like this can completely change the value of a use case. It\u0026rsquo;s important for the teams to actually agree on the value before anything is built.\nA bad governance structure can be a big hinderance in this step. In my experience, Governance team members tend to be managers or people with managerial like roles. This means their jobs force them to focus on KPIs and direction, and less about details of those KPIs. They then delegate the details (e.g. how do we estimate the value of a use case) to sub-teams, and inter-team politics takes over. It\u0026rsquo;s really important for the Governance team to ensure all teams come up with a measurable success criteria everybody agrees to, and they step in when that criteria needs to be updated.\nI also think it\u0026rsquo;s really important for an honest data person to be part of defining the success criteria. A data person is the only one who can give you an accurate picture of what can truly be achieved based on what they are really seeing in your data. If you don\u0026rsquo;t have this type of data person, two or three pilot projects might have to be done so somebody can get the experience needed to provide an informed opinion on the success criteria.\nDoing pilot projects to establish a baseline success criteria might not be appealing from a time and costs perspective, but it will lead to more successful projects in the future, better morale (nobody is motivated to work on projects with bad success criteria), and less wasted time in \u0026ldquo;alignment.\u0026rdquo;\nHow Does all of this Apply to a Low Complexity Project? # Since a low complexity project has far fewer people overall and maybe only a single small cross-functional team, a lot of what is stated above isn\u0026rsquo;t relevant. What is important is the success criteria. Even if a project only has two people, it\u0026rsquo;s important for them to agree on a target. This is the only way for both of them to know what they are striving for. If, after meeting four times, the success criteria is \u0026ldquo;Well we could build a model to detect this, or maybe this other thing, and there\u0026rsquo;s a third thing maybe\u0026rdquo; you actually have no success criteria.\nDefining a success criteria doesn\u0026rsquo;t mean you cannot pivot. Sometimes you spend two months trying to predict a particular issue and realize you cannot predict that issue, but you learned you can build a model for something else. That\u0026rsquo;s okay and you should pivot. Having a defined success criteria also doesn\u0026rsquo;t mean you can\u0026rsquo;t refine it. If you start with \u0026ldquo;Our predictions need to be 90% accurate\u0026rdquo; but you realize only 80% is achievable and still a financial win, it\u0026rsquo;s okay to shift things.\nIt\u0026rsquo;s also important to recognize while alignment across teams is not an issue when you have one team, agreement with management is still important. SMEs and Data Scientists don\u0026rsquo;t randomly do projects together just because they can. Typically a manger or somebody else approved the project and reviewed the financial and time costs and goals. So it\u0026rsquo;s important to at least chat with this manager about your progress, success criteria, and get feedback, especially if you pivot. The last thing you want is your success to become a failure because they expected you to do something different in comparison to the work you actually did.\nMiserable Data Teams # This is a tangential topic, but this post is the right place for it. If you talk to people in Data teams (e.g. Data Scientists and Data Analysts), you might meet frustrated people. Online forums are filled with data people who express a lot of frustration at their data jobs, usually tied to misaligned expectations and not getting the support they feel they need. I\u0026rsquo;m not going to try to establish a single cause for this, but I would like to point out something.\nData teams are ones who see and feel all the dysfunction arising from different teams of people working together when they all have different goals/metrics, and are sometimes incentivized in opposing ways. This is contrast to people who only see the frustrations in their own org, and are somewhat isolated from the chaos around them.\nAn example here might add clarity to this. At one point I had a job in a customer facing role at a company with a lot of different products. For reasons not worth talking about, one of the SaaS products my team supported was struggling in the market. Our entire team was miserable because we spent our days with product issues, frustrated customers, and basically ping-ponging between different internal groups (e.g. sales, product management, engineering management) trying to convince them to change direction because we could see with all the opposing forces and goals of each group, the customer was forgotten.\nOur state of misery was in complete contrast to developers in the Engineering org. We\u0026rsquo;d complain about an API making no sense to use it the way they expected customers to use it. They\u0026rsquo;d brag about how the API could handle X million requests per second, how they provided seamless authentication, and how they solved all these interesting technical problems to do this. They had no idea nearly zero customers had been willing to pay for the service after trying it. The developers were buried down at the bottom of a large Engineering org, and to them they were hitting their goals and they were happy.\nMany Data people see problems across organizations. You can be involved in a project spanning three orgs, and realize poor alignment across these orgs means this project is going to fail. But you still have to work on it. And if you do something to make one org happy, the other two are miserable and you are the face of that. The data person becomes the focus point of all this dysfunction because they are only person who understands how all the pieces are supposed to fit together.\nThe flipside of this is Data people can have a unique view of the organization and what is needed to move things in the right direction. Some Data people are exposed to everything from analyzing individual rows of data to looking at organizational level metrics, targets, and motivations. They have what is equivalent to a sociological understanding of what is really going on. They can map how the micro is supposed to feed into the macro, and why things are not working. With the right support, they can help inform leadership about out what\u0026rsquo;s broken and needs to be fixed.\nWrap Up # The process different teams use to work together will define the success of your predictive maintenance efforts. And it\u0026rsquo;s only when you have a good success criteria that you can define a good process. The best path to the wrong destination is still the wrong path, and if you can\u0026rsquo;t recognize when you have bad success criteria, you\u0026rsquo;ll never be able to understand why your process keeps leading to failures.\nI\u0026rsquo;ve worked on, and been involved in, a lot of predictive maintenance projects, and very few have failed because of a technical decision or the choice to use a simpler algorithm versus a more complex one. If a lot of your projects are not showing any value after they make it to production, or the majority of your projects never make it to production, I\u0026rsquo;d really recommend you step back and investigate your process and where things are going wrong. Somebody probably knows what the issues are, but what they are trying to say is getting lost in corporate noise.\n","date":"15 April 2024","externalUrl":null,"permalink":"/posts/the-anatomy-of-a-predictive-maintenance-project/","section":"Posts","summary":"When conceptualizing and implementing a predictive maintenance project, it can be hard to grasp the entire chain of people and technology needed for success.  In this post I’ll try to break down all the people and teams needed for this type of project, and also provide insight into how these people work together.","title":"The Anatomy of a Predictive Maintenance Project","type":"posts"},{"content":" An Anecdote # One mistaken assumption I see people make with predictive maintenance programs is that the cost savings of many lower value projects will add up to a big number, so it\u0026rsquo;s worth putting in the effort. But sometimes saving a million dollars isn\u0026rsquo;t worth it if it costs you more than that to implement it. In this post we\u0026rsquo;ll look at four types of programs and understand why some of them aren\u0026rsquo;t worth doing because they have a negative return-on-investment (ROI)\nFirst, an anecdote. At one of my previous jobs at a very large company, the CEO announced a contest for people to come up with business proposals on how the company could save money. The CEO tasked SVPs and their teams to come up with ideas and estimate the cost savings of each.\nOur SVP got us together to discuss, and something he said completely shocked me. \u0026ldquo;If you\u0026rsquo;re going to propose something requiring a business org outside ours, it needs to save at least a million dollars. If it\u0026rsquo;s less than that I don\u0026rsquo;t want to hear about it because we\u0026rsquo;ll spend more than that to implement it.\u0026rdquo;\nThis completely blew my mind at the time. Was he really saying this company was so inefficient that it would cost the company at least a million dollars of people-herding to try and save money Was this how large businesses really worked?\nThis post is basically an extension of this anecdote, because anybody with work experience in a large company knows what the SVP said is true. In predictive maintenance, I\u0026rsquo;ve seen many instances where people assume once things scale, their predictive maintenance programs will be worth it. But friction between different parts of the company means the program never scales, and all you have is a lot of impressive PowerPoint slides with nice numbers, but the numbers never add up in the real world.\nPredictive Maintenance and Failure Modes # To make sure we\u0026rsquo;re all thinking about the same type of project, let\u0026rsquo;s define the use case(s) this post is referring to. As stated in the title, this is about predictive maintenance. And by predictive maintenance, we are referring to the idea that we want to predict when a machine fails in defined way and mitigate the effects of this failure, also known as a failure mode.\nA failure mode is another way of saying a specific type of failure. For example, if a plastic hose cracks and leaks, the hose cracking is specific type of failure mode. Note we can only predict certain types of failure modes. If somebody accidently drops their phone into lava and the phone melts, the melting due to high heat is also a failure mode. But that\u0026rsquo;s a very atypical failure mode, and not something we are going to expect. A simple machine (e.g. an automated can opener) might only have a limited number of failure modes. A more complex machine might have thousands of parts and thousands of failure modes.\nThe cost of failure mode is highly dependent on context. A leaking pipe might leak a bit of water onto the floor and not cause any short term machine issues at all. Or if it leaks onto a circuit board and causes an electrical failure, the entire machine can stop working and that is a very high impact failure mode. This context gives us the information we need to define the value and severity of each failure mode.\nTypically, but not always, predictive maintenance models are focused on predicting or detecting known failure modes. The reason for this is if a model predicts a known failure mode, a subject matter expert knows what to look at to validate the issue and what to do to fix it. This greatly accelerates the mitigation or resolution of the impact of that failure mode occurring.\nAt least in my experience, the more general purpose use cases providing a generic \u0026ldquo;something is wrong with the machine\u0026rdquo; aren\u0026rsquo;t very useful from a business perspective. They are impressive in slides (AI based anomaly detection!!), but it is difficult to take any sort of concrete action based on that information. Imagine if your car had a \u0026ldquo;something is wrong with your car\u0026rdquo; notification, but no additional information beyond that. Would that be useful to you? Would you pay extra for that \u0026ldquo;Smart AI\u0026rdquo; feature? Probably not!\nThe Predictive Maintenance 4-Box # Now that we\u0026rsquo;ve defined failure mode and its relevance to predictive maintenance, let\u0026rsquo;s classify these failure modes on two axes. First, a specific failure mode can be high or low business value. Business value is defined as Return On Investment, which means however you measure it (cost savings, increased sales, profits, etc), that is how much more value you generate than it cost to do it. Just for concreteness, let\u0026rsquo;s say a Low Business Value failure mode is $50K US Dollars or less in ROI. A High Business Value failure mode is $150K US Dollars or more in ROI. These numbers may be smaller or bigger depending on the company and use cases so don\u0026rsquo;t take them as set in stone.\nA business can choose to focus on anything in the range of a handful of high value failure modes to a lot of lower value failure modes, or even a mix of both. Where they focus their efforts will determine how much of a ROI they can realize.\nTo create a models for these failure modes, the effort required exists on a spectrum of needing a lot of manual work with a lot of customization, or they can be fully automated with little human intervention is required. Most projects will probably be in the middle, requiring a mix of both. I\u0026rsquo;m going to talk more about how this impacts ROI soon.\nLet\u0026rsquo;s plot all of this in a 4-box. This chart might seem extremely obvious to many readers, but you\u0026rsquo;d be surprised how many teams and projects end up in the bottom half of this chart while assuming they are in the top half.\n%%{init: {\"themeVariables\": {\"quadrant3Fill\": \"#fe8487\"} }}%% quadrantChart title Is your Predictive Maintenance Program Worth Doing? x-axis Manual Model Development --\u003e Automated Model Development y-axis Low Business Value --\u003e High Business Value quadrant-1 $$$ Profit $$$ quadrant-2 Might be worth it quadrant-3 Don't Do it for the money quadrant-4 Might be worth it It\u0026rsquo;s important to recognize this chart is summarizing an entire predictive maintenance program, and not just a single failure mode. If you have twenty use cases and each of them has an ROI of $75K, the total ROI of your entire program is $1.5 million. On the other hand, if you have one use case with an ROI of $250K and five others, where each turns out to have an ROI of negative $50K, maybe it\u0026rsquo;s not worth it even to do the $250K one because it\u0026rsquo;ll cost you more than that to implement.\nI\u0026rsquo;m going to talk about each quadrant in more detail, but let\u0026rsquo;s take a tangent first. I think the challenges with automating things can be greatly underestimated, and it\u0026rsquo;s one of the major ways I\u0026rsquo;ve seen programs slowly slide into a negative ROI. So it\u0026rsquo;s worth it to clarify what automating model development entails, and where the challenges are.\nHow much Effort is needed to Develop a Model? # Manual and Automated Model Development is really about the people hours needed to build a model. A model might be anything from a simple set of rules to something complex with deep learning. However, model building is not just about the actual model. To build a good model, you need the right data and right context. And once the model is built, you need to integrate with other systems (e.g. a repair ticketing system) so customer facing people (like machine technicians) can use the model predictions/outputs to make decisions and do something.\nTo understand what is meant by manual vs automated model development, it\u0026rsquo;s useful to have a high level overview of the steps to build a useful model and deploy it into a production setting. In the list below I\u0026rsquo;ve bolded the steps that can resist all forms of automation and be so time consuming there\u0026rsquo;s simply no way to have a positive ROI.\nCorrectly identifying the problem that needs to be solved and defining a solution with a positive ROI Data work Data curation (finding the right data that identifies the problem) Data cleaning and preparation Building a data pipeline Building a model Deploying the model and generating output/predictions Integrating with external systems Training and working with the people who perform actions based on the model output The first point might surprise people, but I\u0026rsquo;ve frequently seen situations where people can describe a high level problem with high ROI potential, but once it\u0026rsquo;s broken down into a tractable problem it loses most of the ROI.\nFor example, let\u0026rsquo;s say the problem is \u0026ldquo;low fluid pressure.\u0026rdquo; When trying to turn this problem into something that can be solved using data and a model, it turns out there are multiple causes and fixes ranging from changing a leaky pipe (easy quick fix) to replacing a major component of the fluid system (a harder fix that takes much more time). What was a single problem with a previously idealized single solution has now turned into multiple solutions. Multiple solutions may require multiple models and those require all the people-hours required to build them. There is also the possibility some of those models don\u0026rsquo;t work out because of poor performance, resulting in a further decrease of the overall ROI, or even pushing the use case into a negative ROI.\nThe second point (about Data) in the list is really about ensuring a clear mapping between the datapoints being collected and the problem. First, you need to be able to clearly identify which datapoints (out of all data being collected) provide the signal needed to detect the problem. Second, you need to be able to clearly identify when the issue is occurring (abnormal operation) and when it\u0026rsquo;s not occurring (normal operation). This data mapping can be very difficult to scale, meaning for every problem you are trying to capture and predict, the data mapping has to be done manually and with a lot of back-and-forth with different SMEs.\nThe last point is all about the human processes downstream of the output of a model. A prediction from an model indicating an issue might occur isn\u0026rsquo;t useful in itself. There has to be a process for that prediction to be turned into a concrete set of actions people do to diagnose and fix (if needed) the machine. That means all the downstream systems need to be able to process the model output, an action plan based on that output needs to be well defined, and the personnel acting on that plan need to be correctly trained to perform it. Any flaw in this process can result in incorrect or unnecessary actions or misdiagnosis, driving up costs instead of reducing them. If every action plan for every model has to go through an iterative process to create and get it right, that creates more manual work which drives up costs.\nAutomation And Scaling # The primary value of automation comes from being able to increase your ROI by scaling what you have built. There are two ways of scaling relevant to this post. The first is having a model with enough flexibility to deploy across a fleet of machines. This type of scaling works well because you can achieve a high ROI when you add up the total cost savings from each machine. One of the benefits of this type of scaling is since you are focused on one model/failure mode, it\u0026rsquo;s easier to fail fast and recognize when a model can\u0026rsquo;t be built. For whatever reason, there seems to be less ego in acknowledging failure here since you are only letting go of a single use case.\nThe second way to scale is building models for multiple failure modes. With this, the value comes from being able to add up the cost savings from the solution to each failure mode. Even if the ROI for each individual solution is lower, the total ROI is higher if the models can be built and deployed without a lot of manual work. In my experience this is where, unfortunately, a lot of people misjudge how much they can automate the development of each solution for each failure mode. And since the plan was for multiple failure modes to combine into acceptable ROI, it takes much longer for people to recognize and admit the numbers just don\u0026rsquo;t add up. I\u0026rsquo;ll talk more about this in the next section.\nExploring the Quadrants # Now that we\u0026rsquo;ve defined the axes terms and we have an understanding of scaling and how manual work can drive up costs, let\u0026rsquo;s discuss the different quadrants.\nHigh Business Value and Automatable # In this quadrant a typical use case is a high value fault mode occurring often enough that building an automated solution lets you scale the cost savings across machines. The \u0026ldquo;automated\u0026rdquo; part of this doesn\u0026rsquo;t have to be the development of the initial model, but it\u0026rsquo;s more about building variations of the model (e.g. for machines running in different environmental conditions), model retraining, etc.\nAn example is seen with oil and gas equipment deep underground. Simply accessing the machine underground, let alone repairing or replacing it, can cost a lot. Let\u0026rsquo;s say there is an issue resulting in a broken machine, which costs $150k to repair/replace, which includes lost production costs due to downtime. You have 500 machines worldwide, and in the past you\u0026rsquo;ve seen an average of 1 of these fault modes a year per machine. An intervention can be done to mitigate this, but the complex set of conditions causing the fault need to be accurately detected for the mitigation to work. Since the machines are running in different parts of the world, the environmental (e.g. the ground is mostly sand versus a lot of mud) conditions are different and a different model is required for each set of operating conditions.\nIt doesn\u0026rsquo;t take a lot of math to see even if an early detection model only accurately catches 50% of future machine failures, that still leaves a large margin to invest in building and operating a solution with a high ROI. And if more machines are added, the cost savings automatically scales.\nOne question people could ask about the quadrant is \u0026ldquo;I have ten failure modes that theoretically add up to a high ROI. Would that fit into this quadrant?\u0026rdquo; This question can only be answered after you start implementing the solution(s). If you implement the first two manually, and then use that knowledge to automate the implementation of the next 8, and if your ROI numbers turn out to be accurate, then yes, you are in this quadrant. The risk with 10 different failure modes is that there are many opportunities where your ROI expectations are incorrect or the automation simply doesn\u0026rsquo;t work. With ten failure modes it will take you a lot longer to determine if you can build the solution(s), as compared to a single high ROI fault mode where you are likely to figure that out much faster.\nHigh Business Value and Manual Model Development (Resists Automation) # There is a type of problem that is very high value, but developing an automated solution is difficult, resulting in no opportunity to scale. This can happen because you have a very small fleet of specialized machines and there\u0026rsquo;s limited opportunity to multiply the cost savings of each machine. This can also happen if the data is very dirty and requires a lot of human provided context to clean, providing limited ways to automate and share this cleaning process across use cases.\nThis can also occur from a low sample size of your problem. If there is an complex intermix of events causing your whole factory to shutdown for a week, but it\u0026rsquo;s happened three times in the last year, it\u0026rsquo;s hard to have enough data to separate the abnormal from normal operations. You recognize solving this problem can save a lot of money, but the cost of solving it is completely unknown.\nYou can also be in this quadrant when the ROI shouldn\u0026rsquo;t be measured purely from a financial perspective. It could be a product safety issue resulting in injuries or death. Or it could be something that destroys the reputation of the company. Some programs in this quadrant are done to appease regulators or because the company got in trouble in the past.\nEven if a highly automated solution cannot be built, a possible alternative is a mix of human decision making being informed by models. A single or set of model(s) can provide insight into where to look, and pre-built data analysis tools (e.g. a highly interactive dashboard) provide the mechanism to how to look. Data Scientists/Analysts and Subject Matter Experts use these to unravel what is going on. That mix of brain and machine are what it takes to catch the problem before it happens.\nThis just speculation on my part, but when I\u0026rsquo;ve seen business pursue this type of problem, the cost savings, or potential cost savings, is well over $1 Million. In other words, the problem is worth enough that you\u0026rsquo;re willing to tolerate a complex mix of pre-built tooling (the automation) and also allocating a team of people to investigate manually.\nLow Business Value and Automatable # In this quadrant, and also the bottom half of the square, you\u0026rsquo;ve identified a number of lower value fault modes, when added together have a positive ROI. In other words, a handful of these use cases on their own aren\u0026rsquo;t worth it, but a lot of them together make sense. This quadrant could also be called the big box retailer strategy, which is to make a small profit on each item sold and rely on high sales volume to add those small bits together to make a big profit.\nThe key to succeeding here is automating the model building process so the solution for each failure mode can be built quickly. To do this, there are two key things that have to be in place\nWell curated data with good labels, which clearly indicates when the failure mode has occurred and when it has not occurred. A fast and repeatable process for translating SME knowledge into a solution (e.g. Data Engineering and Model code). This is essential for mapping the failure mode to the right data that identifies and precedes the failure. If your data is well labeled and mapped, you can build ways to automate the data engineering for the training data. This means you can quickly identify the right data for all the failure modes you are building models for, and not get bogged down in a lot of manual data cleaning work along with excessive back and forth with SMEs. If your data is really bad, having the best SMEs in the world isn\u0026rsquo;t going to save you. A clear sign that you are in the next quadrant (you cannot automate development), is if you have spent six months working on solution(s) and you are still bogged down in a lot of manual data work to build the training data for each use case.\nRegarding the second point in the list above, it\u0026rsquo;s also common to be in a situation where once you start working on a use case, it becomes clear there is a lot of additional context needed nobody initially realized was important. It can take weeks or months of discussions to uncover all this context because it takes that long to discover all the right questions to ask. It\u0026rsquo;s expected for the first few use cases you implement, the process is going to take more time and have more friction. Initially you have to learn how to streamline the process of defining use cases, the communication strategy with SMEs/stakeholders, and identifying the right data. But if after the tenth use case this process is almost as manual and cumbersome as the first four, you should make sure the process can really be streamlined or if you\u0026rsquo;ll never be able to automate it.\nSomething I\u0026rsquo;ve seen happen is the technology parts of these processes get automated, but the human parts never get streamlined beyond people building more trust with each other and creating some PowerPoint/Excel templates for defining the use case. The end result is the infrastructure to develop and run these models in production and the integration with the service ticketing system is ready to go. But the parts where the use case and \u0026ldquo;win\u0026rdquo; is clearly defined, the data is clearly identified and well labeled, and service personal are trained well enough to handle the alarms, is never streamlined. What you end up with is a system ready to scale, but it never scales because you cannot automate the hard parts. From there you fall directly into the next quadrant.\nLow Business Value and Manual Model Development (Resists Automation) # This is the quadrant I\u0026rsquo;m going to write the most about because I\u0026rsquo;ve seen many companies end up here and not really recognize (or admit) it, even if some individuals involved in the project realize they are here. This quadrant is also not limited to predictive maintenance or industrial machines. People in other industries will recognize it as well.\nIn this quadrant, you are unable to scale the development of your lower ROI use cases because of human and data issues. You never achieve your projected total ROI because you spend too much time and money developing solutions for each failure mode. In the end, your ROI is actually negative because you spend more money building models and dealing with issues than anything you actually save.\nLet\u0026rsquo;s look at some numbers. Assume each of your use cases generates a projected $50K cost savings. You\u0026rsquo;ve been working on these for a year with a team of two data scientists and one data infrastructure person. It takes about two months for each use case to move into production.\nAfter the first year, you get ten use cases into production, meaning you recognize $500K in cost savings on some spreadsheet. Keep in mind it will take some time to see the actual cost savings materialize in the field, so this $500K is really just a forward looking estimate. This cost savings isn\u0026rsquo;t even going to pay the salaries/benefits of the three team members, let alone all the other infrastructure and support people needed to realize that cost savings.\nWhen you started the project you expected for the first two years the ROI will be negative, but after that the math will work out. In year two you complete ten more use cases. In year three, ten more. Now you are saving $1.5M, which looks a lot better. Plus, over three years, the number of machines out in the field is projected to increase, so this savings will be even higher.\nHowever, there are different reasons these cost savings numbers can be misleading. Let\u0026rsquo;s look at a few.\nLet\u0026rsquo;s talk about true and false positives. A true positive is when your model correctly predicts a failure mode is going to occur. You realize a cost savings when somebody acts on that prediction and reduces or removes the impact of that failure mode. A false positive is when this prediction is incorrect and that failure mode is not actually going to happen. Every false positive means a person spends time and money diagnosing or trying to fix a problem that doesn\u0026rsquo;t exist. The number of false positives scales with the number of machines in the field. More machines means more false positives, which means more costs. Every false positive costs money, and if the ROI wasn\u0026rsquo;t very high to begin with, the ROI becomes negative very quickly.\nAnother issue with false positives is at a certain point, dealing with them makes people distrust the entire system. If people don\u0026rsquo;t trust the Alarms, they won\u0026rsquo;t even trust or act on the true positives. If this happens, all the potential savings goes to zero.\nHow bad the \u0026ldquo;rate\u0026rdquo; of false positives feels is also affected by perception. If you have one false alarm out of ten alarms, it doesn\u0026rsquo;t feel bad. If you have ten false alarms out of hundred, or hundred out of a thousand, the percentage of false alarms is the same in all cases. But if you are the person acting on the alarms and dealing with the pain of checking ten or a hundred bad Alarms, your perception of the model is bad. So it\u0026rsquo;s not just about the percentage of bad alarms, but really how much impact these are having on your field personnel.\nAnother factor that can decrease your ROI is predictive accuracy degrading over time with any hardware (e.g. a frequently failing part is reengineered) or software changes with machines. The more models you have, the more effort is required to monitor them and make sure they are performing well. Perhaps your Data Science team targets to release ten new models a year and you do this the first year. But in the second year, your development slows down because you need to ensure those ten models continue to run well. The effort to review and maintain older models only increases over time as newly released models inevitably become old models. All of this will either slow down the rate of which you can address new use cases or you have to grow the team, which will increase the costs.\nThere is also an opportunity cost to a predictive service program. Beyond the core development team, there\u0026rsquo;s the time investment of all the other people (e.g. SMEs, Data Engineers, Service Engineers) to support this. They could have spent their time doing something else, so it\u0026rsquo;s important to account for their costs when justifying your program.\nAll of these issues (and many others) mean your original ROI estimates are very optimistic. They might assume a perfect system that identifies every issue without errors, and every issue is dealt with perfectly by the service personnel. That $1.5M estimate for the third year can get cut quite drastically and you might not even get a third of it. If you are a billion dollar business, is it really worth investing in a project providing $500K annual cost savings after three years? You could fire your Data Science team and save that money with much less effort.\nHow do people end up in this quadrant? What I\u0026rsquo;ve seen happen is a predictive service program being conceptualized with some number large (e.g. $10M/year) enough for upper management to take it seriously. Then, as the program is implemented, the actual savings is not well tracked and people refer to the original financial projections when asked. As employees and managers change, teams reorg, and priorities and ownership change, the line from that original number to today becomes even more blurry. After two years, nobody can clearly state where that original number came from and how it ties to the current state of things. But it\u0026rsquo;s still a target, so people stick their head in the sand and try to hit it.\nCompanies can also land in this quadrant when they need a predictive service program for strategic reasons. If your competition claims to do \u0026ldquo;predictive service\u0026rdquo; it\u0026rsquo;s going to be challenging for you to tell customers you don\u0026rsquo;t have it or you don\u0026rsquo;t believe in the value of it. Sensible strategy turns into dysfunction when the only way to justify this type of program is by making claims of a large financial advantage. I\u0026rsquo;ve seen this at larger companies, where the only way to get multiple departments (e.g. machine engineering, data science, customer service) to work together is to get buy-in from upper management and the top of the middle-management layer. A bold financial benefit claim is made and set as a target, resulting in a program which is doomed from the start.\nIf you are in this quadrant, getting out is tough. Simply dissolving or watering down the program and rebooting it a few years later with newer people and technology seems to happen often. As much as I dislike the nastiness that comes with doing this, if something is a big enough mess, then destroying and rebuilding it might be easier than trying to fix it.\nIf you do want to climb out of this quadrant, I\u0026rsquo;d recommend properly identifying the root cause(s) of the struggling program instead of helicoptering in new people to fix it and giving them deadlines not grounded in reality. They\u0026rsquo;re only going to start covering up the symptoms instead of understanding the root cause.\nWhat are the root causes of this type of program going off the rails? They can be quite varied, including\nDifferent stakeholders having different interpretations of how bad the data quality issues are and what the real impact is on reducing the ROI. Stakeholders having inconsistent definitions of business value, meaning one stakeholder thinks a problem is worth $100K, whereas another stakeholder feels this problem has negative ROI. Too much of a top-down approach, where the most \u0026ldquo;valuable\u0026rdquo; use cases that have to be implemented (nobody can say no) are defined in a very academic way, without a robust data analysis of actual field data to support those ROI values. Overly focusing on checking as many \u0026ldquo;we got it into production\u0026rdquo; boxes as possible, resulting in high ROI in a spreadsheet, but not in reality. Minimal overlap between use cases or fault modes, making it difficult to share knowledge or share development methodology/code, which reduces opportunities for automation and scaling. Every solution then becomes a unique snowflake. Notice all the points above are really about alignment between the different people involved in the program. Even the fourth point is really about making sure people understand use cases and the associated context need overlap to be scalable. A Data Scientist might understand, but people who are not data professionals might not realize the extent to which overlaps are needed to support automation in a data and software process. If different stakeholders aren\u0026rsquo;t aligned, even declared successes by one group can be perceived as failures by a different group.\nI\u0026rsquo;ll finish up this post by talking about what gave me the idea to write it. I\u0026rsquo;ve seen too many instances where a program failed and people pointed their finger at the symptoms instead of the cause. It\u0026rsquo;s easy to scapegoat the data science team, but they\u0026rsquo;re usually only the face of deeper organizational problems. I hope this post provides insight into some of the broader factors that drive success and failure in predictive maintenance programs, and you use them to better triage and diagnose any issues you are having. If nothing else, you know more about what types of projects to avoid.\n","date":"26 February 2024","externalUrl":null,"permalink":"/posts/is-your-predictive-maintenance-data-science-program-worth-it/","section":"Posts","summary":"One mistaken assumption I see people make with predictive maintenance programs is that the cost savings of many lower value projects will add up to a big number, so it’s worth putting in the effort.  But sometimes saving a million dollars isn’t worth it if it costs you more than that to implement it.  In this post we’ll look at four types of programs and understand why some of them aren’t worth doing because they have a negative return-on-investment (ROI)","title":"Is Your Predictive Maintenance Data Science Program Worth it?","type":"posts"},{"content":" Introduction # In my previous post, we looked at how stakeholders and organizational structures influence decisions about what data is collected and exposed for use. If you haven\u0026rsquo;t seen machine data before, that post might have felt a little abstract and some things might have felt contradictory (e.g. how does one look at the events before a failure without any sensor data?). For people who prefer concrete details and real examples, this post is for you.\nWhen I was involved in my first predictive maintenance project, I had a fairly naive view of machines and the business of deriving value from machine data. In my mind there was a machine that generated a lot of data and you analyzed that data. I didn\u0026rsquo;t really recognize or think about topics like\nWhat\u0026rsquo;s the bare minimum data actually needed to run this machine? What data is not needed, strictly speaking, to run the machine, but is necessary for better diagnostics? What additional data is required for predictive maintenance? What is the cost and effort required to get more data from the machine? Does the cost of a predictive model even make sense? Will the model cost more to build and operate than it saves the company? A good starting place is to understand how a machine works and what data can be expected from it. If we start from the basics, we can gradually build an understanding of how people define the requirements for what data should be collected, and how those requirements change as needs become more complex. This then connects to the previous post, where we see how different stakeholders influence this process.\nA Basic Machine and the Data it Generates # The Data Needed to Run a Basic Machine # To explore what data is needed to run a basic machine, let\u0026rsquo;s start with a very simple machine that cleans dirty beakers. The Clean Beakers machine is basically just a conveyor belt that holds beakers and move them under some cleaning nozzles that spray cleaning fluids. Let\u0026rsquo;s imagine this machine operates in factory that mixes chemicals in glass beakers, and the beakers are reused. The overall process of the factory can be seen in the diagram below.\nflowchart TB LB[Load Beakers] --\u003e CB[Clean Beakers] --\u003e MC[Mix Chemical in Beaker] --\u003e EB[Pour Chemical Out of Beakers] --\u003e LB style CB fill:#e69138 A person starts this process by loading beakers onto the conveyor belt. The beakers are first cleaned, and then moved to another machine that mixes some chemicals in these beakers. The mixed chemical is poured out, and then a person carries these dirty beakers back to the starting point. This process could also be automated. So instead of people, a big conveyor belt moves the beakers between the four stations (each box in the diagram is a station) in this loop.\nNow that we understand how this beaker cleaning machine operates in the larger factory process, let\u0026rsquo;s look at how the beakers are actually cleaned.\nflowchart TD BP[Beaker Arrives] --\u003e BS[Beaker Washed With Soap] --\u003e BW[Beaker Washed With Water] --\u003e BC[Beaker Clean] An upside down beaker (the mouth of the beaker is facing the floor) enters the machine on the conveyor belt. The conveyor belt takes the beaker and moves it over the soap nozzle. Once the soap is sprayed, the conveyor moves the beaker over the water nozzle and water is sprayed to rinse out the soap. The beaker is now clean and is moved on to the next machine where the beaker is flipped over and the chemicals are mixed.\nWhen this machine was being built and the requirements for what it should do were gathered, the engineering team(s) made decisions about what data this machine needed to run, and what data would be exposed to the outside world. Typically, the more basic the machine, the more basic the data requirements might be.\nIn the most minimal sense, what information or data would be needed for this cleaning machine to function? Let\u0026rsquo;s first define the way the machine works.\nThe machine consists of a loading area, a single soaping area, and a single water rinse area. There is a simple weight switch to detect if a beaker is present. No weight means no beaker is present, and some weight means a beaker is there. The switch doesn\u0026rsquo;t actually measure the actual weight, it\u0026rsquo;s just an on/off or yes beaker/no beaker. Below the soap and wash areas, there is a nozzle. Each nozzle is connected to a valve that controls the water or soap being sprayed. The valve is either on or off, there is no in-between settings. There are a total of 3 switches. One switch where the beaker is loaded on the conveyor belt to detect when something is placed on the belt. One to detect when a beaker is above the soap nozzle. One to detect a beaker for the water nozzle. A simple motor moves the conveyor belt one step (e.g. from above the water nozzle to above the soap nozzle) at a time. Each step is of equal length. A very basic industrial computer (e.g. a PLC) that controls the machine based on inputs from the switches. The computer also controls the valves and the motor. Based on that, this is the logic of how the machine functions and how the computer controls it. The square boxes are actions performed by the machine/computer. The ovals and floating text just provide more information on what is going on or the decision.\nflowchart TB BP([Person Places Beaker On Conveyor]) --\u003e LAS[Check Switch in Loading Area] SS[Check Soap Nozzle Switch] SN[Spray Soap] WS[Check Wash Nozzle Switch] WN[Spray Water] LAS --\u003e |No Beaker| DN[Do Nothing] LAS --\u003e |Beaker Present| MC[Move Conveyor One Step] MC --\u003e SS SS --\u003e |No Beaker| DN2[Do Nothing] SS --\u003e |Beaker Present| SN --\u003e MC2[Move Conveyor One Step] MC2 --\u003e WS WS --\u003e |No Beaker| DN3[Do Nothing] WS --\u003e |Beaker Present| WN --\u003e MC3[Move Conveyor One Step] MC3 --\u003e BCF([Beaker Cleaning Finished]) --\u003e |Repeat Process| BP Another way to read this chart is to see what signals the computer receives and the decision made based on that signal. For example, let\u0026rsquo;s think about the \u0026ldquo;Check Switch in Loading Area\u0026rdquo; box. If somebody places a beaker there, the switch gets triggered and the \u0026ldquo;on\u0026rdquo; signal is sent to the computer. The computer then sends a signal to the conveyor motor to move it. Instead, if that switch was \u0026ldquo;off\u0026rdquo; because no beaker was there, the computer would do nothing. So the data sent to and from the computer only consists of different on and off signals.\nIf we were to record every event this machine does in order of what is done, it would look like this. This type of data is called Event data and is typically stored in an Event Log.\nEvent Time (hour:minute.second AM/PM) Switch On: Beaker Placed on Machine 1:00.00 PM Conveyor Moved 1:00.30 PM Switch On: Beaker Over Soap Nozzle 1:01.00 PM Soap Nozzle Activated 1:01.30 PM Conveyor Moved 1:02.30 PM Switch On: Beaker Over Water Nozzle 1:03.00 PM Water Nozzle Activated 1:03.30 PM Conveyor Moved 1:04.30 PM The steps above would be repeated over and over in the Event Log.\nThe computer might have some additional logic to make sure a switch is turned off after it\u0026rsquo;s been turned on and other logic to handle possible errors. But this is the basic logic required for this machine to run, and you can see the data we would get.\nKeep in mind that this data might not be available to anybody to take off the machine. The data you see above could be limited to the computer and no programming or storage was put into place for anybody to actually access this data. So you shouldn\u0026rsquo;t assume that if you plug your laptop into the machine computer that you\u0026rsquo;d be able to see this data. Somebody has to write the software code to expose this data to the outside world. But for the sake of making this easier, let\u0026rsquo;s assume this machine stores this data somewhere and people can access it.\nDiagnosing an Issue Using Data # Now that we\u0026rsquo;ve described this basic machine, let\u0026rsquo;s move to the scenario of diagnosing an issue using data. One day somebody notices the beakers are not being cleaned properly. Let\u0026rsquo;s also say that due to the physical design of the machine, it\u0026rsquo;s hard to just look into it to see why the cleaning is not being done properly. So nobody can just watch it running and see the issue.\nIf somebody pulls 10 days of log data (described in the previous section) to investigate, all of it is going to look exactly the same. There\u0026rsquo;s no clear indicator of the machine running differently, when the issue started, or what the cause is. The log data tells us the machine is running and we can calculate things like uptime/downtime and utilization (how much this machine is used in a day), but we have very little information about what\u0026rsquo;s really happening during the various processes the machine performs.\nIf you\u0026rsquo;re only looking at the data, you\u0026rsquo;re not going to be able to diagnose the issue, and you\u0026rsquo;re certainly not going to be able to develop a predictive rule or model about when the machine is going to stop cleaning properly. Despite the fact that the machine is running and providing data, the data isn\u0026rsquo;t very useful.\nIf we wanted to diagnose the cause of the cleaning issue based purely on data, we could identify possible failure modes and the data needed to detect those. Here are some examples.\nFault Mode Data Required to Diagnose Nozzle Not Fully Opening We need a way to track exactly how much the water and soap nozzle(s) opens Water/Soap Pressure Issue We can track the water pressure in the pipes that go to the nozzles, maybe at different locations along the pipes Conveyor Movement Issues Tracking how much the conveyor actually moves at each step might identify if the beaker is not placed at the right position over the nozzle If we were also collecting this data, maybe we\u0026rsquo;d see that the water pressure exiting the nozzle has gone down over time. If we have multiple pressure readings along the pipes, maybe we could further diagnose it as a supply issue (not enough water is getting to the machine), a blockage issue (the water supply is fine, but one of the pipes in the machine is getting clogged) or a nozzle pressure issue (the nozzle is not spraying at enough pressure). Based on this, we could build a simple model that Alerts on when the water pressure is trending downwards and is predicted to be too low (to clean the beaker effectively) in the future.\nThis type of data is also known as Sensor data. In contrast to Event data, sensor data gives you information about what was going on during a process, not just after it. Event Data typically tells you about events after the event has already started or occurred, but has limited information about the conditions during the actual process that is occurring during that event. It\u0026rsquo;s the combination of Event and Sensor data that tells you when a process starts (a Start Event), what is going on during the process (the Sensor Data), and when a process ends (an End Event).\nHow do people decide what data needs to be collected? # Imagine when this beaker cleaning machine was designed, the original requirement was to build something priced to sell to small, low-volume factories with limited money. Engineering and Product management decided the way to reduce costs would be to build the minimal product described above.\nOver time, the machine became more popular and higher volume customers started buying it. They also started complaining about it being difficult to service because it was difficult to diagnose issues. These customers also started to make warranty claims out of frustration, stating the machine was malfunctioning due to product defects. And since diagnosis was difficult, it was not clear if this was a genuine warranty claim, a maintenance issue, or people not operating the machine properly.\nTo fix this situation, the service organization proposed a solution. They would create a subscription service plan and fix any machine issues as part of the subscription cost. To do that, they would need the machine to collect the data described in the previous section. They also proposed a \u0026ldquo;subscription plus\u0026rdquo; plan which offered remote diagnosis of the problem. That way the problem would be identified quickly before anybody even went onsite, and the service technician would already have the parts needed (along with a repair plan) to fix the machine as quickly and efficiently as possible. To implement the remote diagnostic capability, the machine control computer needed to be upgraded (read: more expensive), the software needed to be updated to expose the data, and a software tool needed to be implemented to transmit data to the remote service center.\nFrom a business standpoint, there are a few considerations with this subscription plan idea.\nThe Service Org needs to provide a business plan demonstrating this subscription service is profitable, and it\u0026rsquo;s worth the time, effort, and money for Engineering to update the machine Product Management needs to understand if customers are willing to pay for this updated machine and associated service plan. Are there enough potential customers? Engineering needs to look at their priorities and decide if it makes sense to make these updates to the machine, or to have 2 versions of this machine. One basic version, and one with the extra data capabilities. And based on that, can they can support both machines? The Data Infrastructure org also needs to be made aware that this service is coming and they will need to build a pipeline to ingest the machine data, process it, and expose it in a usable way (e.g. Dashboards) to the personnel in the Service Org. The specifications of the pipeline might influence what data is sent from the machine since the Data Engineers might need something specific that\u0026rsquo;s not in the existing machine data requirements. And whatever this costs to do, it needs to be included in calculations for the service plan. The Marketing and Sales Orgs will probably need to be involved as well since they can help provide insight into the customer demand for these features. Assuming the product updates are made, these orgs will definitely need support with their go to market strategy to market and sell it. Moving to Predictive Analytics # Two years have gone by. Product management and Engineering decided on two machines, and the second one not only included the extra data capabilities, but it could clean more bottles per hour. This brought in a lot of new customers who were willing to pay for the plus plan that includes remote diagnosis. Customers really like the new service, and the subscription plan is profitable as well.\nHowever, there are two growing concerns. The first is that the service department only knows about a machine issue when a customer calls and creates a ticket. Basically, the service is 100% reactive and customers are already experiencing an issue by the time they call. Customers are increasingly asking for automatic and proactive identification of machine issues before it affects them.\nThe second is that the remote diagnosis is completely manual. After a customer calls, somebody with diagnostic training opens dashboards, identifies the issue, and then provides a recommend plan to fix it. Since the diagnostic process is manual, the number of technicians has grown in proportion to the number of customers. However, from a budget perspective, this isn\u0026rsquo;t scalable because the number of diagnosticians required is cutting into the profit from the subscription revenue.\nTo address these two problems, the Data org is asked to come in and evaluate the feasibility of automated detection of machine malfunctions before the if effects the customer, sending out a notification Alarm based on this, and including a recommended fix plan with every Alarm. The Data team spends some time evaluating this and provides the following recommendations\nCurrently, the data from each machine is a daily summary of the sensor values. For example, daily maximum water pressure, daily minimum water pressure, and daily average water pressure. Since it is a daily summary, it is only sent once a day. The Data team feels that for good predictive service, they prefer the raw un-aggregated sensor data in an ideal case, but hourly summaries would be sufficient. It would also need to be sent 6 times a day. They need a budget for the infrastructure to develop and deploy these models The infrastructure will need to integrate with the other systems, such as the Service system that manages customer tickets. They need a dedicated team of at least one Data Engineer and one Data Scientist to be able to support building the models, but also to support any tickets about the data quality, models, and Alarms. All of the above will eat into the revenue pie of the subscription services, so it\u0026rsquo;s important to estimate how much automating these things will save the company compared to just hiring more people to perform the remote diagnoses. In other words, how much cost reduction is realistic if the Data org is able to successfully implement their ideas? How accurate do the Alarms and recommendations have to be to actually realize these cost savings and for it to be considered a success?\nThe Motivations for Data Collection # At this point we can stop building out this story. There are many different directions this could go, all of which would further illustrate the original goal of this post and the previous one; that data is not collected and provided just because that\u0026rsquo;s what one does. Collecting data takes time, effort, and money, and there needs to be a useful reason to why it\u0026rsquo;s done.\nStakeholders can be very differently oriented in how they think about the value of data and where investments should be made. Engineering is focused on what data is needed to ensure the machine runs. Service wants data that allows them to service the machines as cheaply and quickly as possible while balancing driving down service costs with keeping customers happy.\nIn the predictive analytics section of this story story, the Data team is really an extension of Service. If the Data team said they wanted higher frequency data just because they want more data, Engineering would completely deprioritize the request. However, since this is really a request to reduce costs and improve what the Service org provides to the customer, the Engineering group might be pushed to prioritize it.\nSometimes these relationships are adversarial. The Engineering org could disagree with the way the Service org wants to do this and they feel the machines can have the predictive algorithms running on the machine computer. That way there\u0026rsquo;s no need for the Data org to get involved at all, and the Service department doesn\u0026rsquo;t need to own it. If the code is running on the machine, the slice of the subscription revenue allocated to the predictive service feature will go to Engineering and not the Service Org.\nIf we think back to the previous post where we talked about all of these functions being in the same company versus multiple companies, each one has it\u0026rsquo;s positives and negatives. For example, if the company that built this machine was purely a machine builder, they might say they don\u0026rsquo;t want to get into the servicing business. The most they\u0026rsquo;ll do is some minimal work to make it easier for service providers to add data export and transmission capabilities to the machine. When the machine builder and service are different companies, there\u0026rsquo;s a clearer delineation on where each should focus. If the Engineering and Service org are in the same company, there might be more push from upper management for the groups to work together, since any cost reduction will benefit the company as a whole. At the same time, this might result in a worse quality predictive service product because both of them have no choice but to work together and neither wants to hold the other accountable for issues and risk harming existing relationships.\nThe Evolution of Industrial Machines and Technology # In the example in this post, the machine is very simple and only needs a very basic computer running a relatively simple set of instructions to control the machine. Even if the machine was collecting the hypothetical sensor data, it would still only be a few sensors and an easily manageable set of data.\nIn contrast, there are machines so complex they have hundreds or thousands of sensors, multiple computers, and lots of additional microprocessors to control everything. The software running on these machines is also just as complex, and is basically a product in itself. Examples of these complex machines include the lab equipment used to analyze human blood samples and machines used to manufacture microchips.\nWhen you hear about or see these types of machines, it\u0026rsquo;s hard to wrap your head around how it was designed and built. One question I always ask myself is how did people even start or plan it out? Imagine if somebody asked you to build a car. Where would you even start? How do you build something so complex from scratch?\nThe answer is, you don\u0026rsquo;t build it from scratch. The complex machines are based on the lessons learned from less complex machines, which are based on lessons learned from machines that were less complex than those ones. Over time, people make mistakes and learn how to make things better. They also learn what is needed in the real-world and they evolve their products. Lots of new machines have hardware and software in them that was made and used in an older machines.\nAs an example of this, let\u0026rsquo;s look at the computer that controls these machines. At first, the control systems were basically all hard-wired and it was very difficult to change the logic. Then came the PLC which gave people much more flexibility than hard-wiring the logic to control the machine. Modern PLCs are basically regular computers (with lots of cores) that have all the software and hardware (e.g. connectivity) to operate in an industrial environment. They can also be extended to run docker (e.g. python in a container), a messaging bus, a relational database, you can attach an NPU (for Deep Learning Inference), and do lots of other things.\nWith this machine evolution comes an evolution in what data is collected as well. As machines are designed and used by customers, people realize what additional data needs to be collected in the next version of the software or machine. Over time, as sensors become smaller/cheaper, and programmable microcontrollers become more flexible and powerful, having more data becomes essential to these more complex machines operating properly. Then the discussion changes from \u0026ldquo;why collect data\u0026rdquo; to \u0026ldquo;what\u0026rsquo;s the right data to collect?\u0026rdquo; Here are two examples of this.\nA complex machine already had lots of different event and sensor data being used to control the machine, but due to the high data volume, most of it was not stored and thus never exported. In fact, the software and database on the machine weren\u0026rsquo;t really designed to store and export all that data outside of some temporary debugging modes. To enable any additional data export would require a major software update that didn\u0026rsquo;t cause issues with any existing data exports. So for any additional data exports, there had to be good rationale for what data was really needed to address any use cases. A factory was collecting data from multiple machines that were all part of the manufacture of a single thing. A 6 month sample of this data (thousands of sensors) was multiple terabytes compressed. There was a very expensive issue that was occurring, and they were trying to better understand the circumstances under which this issue occurred, not build an accurate model to predict the issue. After an extended timespan where multiple internal and external teams of specialized Engineers and Data Scientists looked at this, the conclusion was that despite collecting so much data, they were not collecting the right data. Note that data limitations do not arise only from machine data. Certain use cases require the alignment of machine and non-machine data and this is not always feasible. One widget manufacturer was facing the problem that some widgets would crack during the final stages of the manufacturing process, and there was a theory that it was a combination of manufacturing process and some supplier quality issue. There was some issue with the raw material that resulted in a reaction to some condition in the manufacturing process. All raw material did undergo a quality review when it was delivered, but nothing stood out. In this particular factory, each widget was created individually, but in later stages many widgets were batched together (e.g. heating lots of widgets together in an oven). So there was no way to match or align the sensor or quality data with the individual widgets and match the data to the widgets that cracked.\nWith what I\u0026rsquo;ve just said, I don\u0026rsquo;t mean to imply that machines and data collection have reached this point where every machine is generating a lot of data and the challenges are now around data alignment/cleaning and figuring out which is the right data to collect. Many machines collect minimal data that is useful only for some basic KPIs. It also takes time for Industrial customers to upgrade and update their machines. If a factory was built in 1993, it\u0026rsquo;s almost a given that many of the original machines are still operating 20+ years later. And those machines were were probably designed based on technology older than 1993! Unless there was good reason to upgrade, the computer controlling those machines would still be that old. That\u0026rsquo;s assuming the computer controlling the machine isn\u0026rsquo;t a human!\nWrap Up # I hope this post gives some concreteness to the idea of a machine and what data it generates. I realize the topic will still feel a bit vague since I\u0026rsquo;ve given general examples of machine data, but there\u0026rsquo;s no spreadsheet here with data extracts from machine and non-machine data. My intent was to add clarity to the idea that data isn\u0026rsquo;t collected for the sake of being collected, and that there are conscious decisions made about what data is collected an exposed. Also, the calculus around these decisions changes as technology improves, and the justification needed to make a decision today might be different than what was needed a decade ago.\n","date":"23 January 2024","externalUrl":null,"permalink":"/posts/deciding-what-data-to-collect/","section":"Posts","summary":"In my we looked at how stakeholders and organizational structures influence decisions about what data is collected and exposed for use.  If you haven’t seen machine data before, that post might have felt a little abstract and some things might have felt contradictory (e.g. how does one look at the events before a failure without any sensor data?).  For people who prefer concrete details and real examples, this post is for you.","title":"Deciding what Data to Collect","type":"posts"},{"content":" Introduction # This post will be the first of two that discusses how organizational dynamics, stakeholder incentives, and the goals of the business drive decisions about what data is collected and why. What is discussed is relevant to companies trying to do predictive maintenance on industrial and commercial machines, but it applies to other industries as well.\nIn my experience, when people first start learning about things you can do with data, there tends to be a focus on the last part of the process, where one is trying to use that data as part of a hypothetical business decision. If you are a Data Scientist, this means you focus on using models make an accurate prediction. If you are in BI person, your goal is to learn how to make a dashboard with useful charts. If you are a Data Engineer, you learn how to transform raw data a more user friendly form.\nThere isn\u0026rsquo;t much focus on why this data was generated and stored in the first place, which is the start of this data process. Data isn\u0026rsquo;t collected just because we have the technological capabilities to collect it. Data is collected because people made the decision that collecting this data would be useful, and what it meant to be \u0026ldquo;useful\u0026rdquo; was translated into technical requirements for what data needed to be collected.\nWhat people mean by \u0026ldquo;useful\u0026rdquo; is not static. Different stakeholders and organizations have different goals and can have very different ideas, sometimes opposing, of how the same dataset can be useful to them. And as companies change, and technology evolves, this can also change what people find useful. All of these stakeholders and changes feed into what people want to do with data, and what data needs to be collected to satisfy this.\nThis post will be the first of two that discusses the organizational dynamics I\u0026rsquo;ve seen with companies trying to do predictive maintenance. The overall goal is to try and provide some insight into how different stakeholders with different business motivations can influence what data is collected and how it\u0026rsquo;s used. To get there, it\u0026rsquo;s useful to look first at the architecture of how data might flow in a predictive maintenance project and then make this more concrete using examples of what data is generated and collected on a machine.\nTo avoid the this post turning into a novella, I will break this into 2 parts.\nThis first post will present some data architectures as a framework to explore how different organizational structures can drive why and what data is collected and how it\u0026rsquo;s used. Deciding What Data to Collect. The second post will provide more concrete examples of what data people might want to collect from a machine, and their motivations for doing so. It\u0026rsquo;ll be more technical, but I think this will provide a lot of context to the first post regardless of if you care about technical details or not. I don\u0026rsquo;t think there\u0026rsquo;s an absolutely correct way order these posts, but I thought it made sense to start with architecture first. If you prefer to jump from here to the second post and then come back to read the rest of this post, that shouldn\u0026rsquo;t be an issue.\nArchitecture Overview # In this post we\u0026rsquo;ll look at the data architecture(s) and data flow(s) of a typical use case seen with industrial machines. This post is focused on the situation where there are machines generating data in a remote location (e.g. a factory) and there is a person or group of people who need centralized access (e.g. in a corporate office) to this data. An alternate to this is somebody who is at the same location as the machine and is directly collecting and analyzing data straight from the machine, but this post isn\u0026rsquo;t focused on this latter type of use case.\nOn the surface, this data flow is quite simple. What I am really trying to communicate is the complexity that arises from different organizations owning the data, and different stakeholders having differing incentives and ideas about how to extract value from this data. So try to avoid taking the architecture diagrams at face value, and instead mentally map them to the organizational abstractions they represent.\nSomebody reading this post might have a lot of questions about the technology choices and data products used in this architecture, but that\u0026rsquo;s not really the point of this post. The architecture is just the framework to help understand how this data is owned by different organizations in any given company. If it helps you to visualize things, pick your tool(s) of choice and fill the boxes with those.\nLet\u0026rsquo;s start with a very simple data architecture of machine data.\nflowchart LR M[Machine] --\u003e C[Raw Data Landing Zone] --\u003e DW[Data Warehouse] --\u003e DS[Data Scientist] What do each of these boxes mean?\nA Machine Generates Data. This data is stored locally in a database and/or files. Machine data might be Event Data, which has data from known events with established events codes (e.g. event code 03040506 means the a robot arm picked something up) Sensor Data Images/Audio/other non-textual data. A software agent on or connected to the Machine uploads this data to a server somewhere in the form of structured (e.g. XML) or unstructured files. These raw file are securely copied to a server we\u0026rsquo;ll call a Raw Data Landing Zone. The raw files are processed and exposed in a Data Warehouse. An example of this is if data from machines around the world are combined into one table that users can query. Typically some type of automated Data Engineering occurs between the Raw Data Landing Zone and the Data Warehouse The diagram above is a very basic representation of a use case where a machine is generating data and somebody somewhere is using that data. Let\u0026rsquo;s add a bit of complexity by adding non-machine data sources. \u0026ldquo;Non-machine\u0026rdquo; data is data that can be linked to that machine, but is not generated by that machine.\nflowchart LR M[Machine] --\u003e C[Raw Data Landing Zone] --\u003e DW[Data Warehouse] --\u003e DS[Data Scientist] S[Service Data] --\u003e DW Q[Quality Data] --\u003e DW In addition to data from the machines, we also have Service Data. This is data created by service people and technicians who service these machines. This type of data has both structured and unstructured components. Examples of the structured parts of this data might be When somebody opened a ticket due to a machine issue When a service technician started and stopped working on the problem Some keywords organized in hierarchical list describing the problem (e.g. Subsystem B -\u0026gt; Fluid Pump -\u0026gt; Leaking Outlet Pipe -\u0026gt; Replacement) What work was done and how long it took What parts where changed The unstructured parts might be typed up notes documenting every single interaction with the customer, and also a way for the service technician to describe their work beyond what is possible to record in the structured data (e.g. entered using a form). An example of this might be \u0026ldquo;Met with machine operator who said this is an intermittent issue, but is happening with increased frequency. No issues reported in self-diagnostics after running it 2 times, but issue occurred as we were observing normal operation.\u0026rdquo; The actual text might not be this clear and there are likely to be multiple languages in the data at a global company. Quality Data. Either before, during, or after the machine process, some measurement of quality is performed. Sometimes this process is done manually by a person, so you only have snapshots of quality. Some machines can also automatically perform quality checks, but these checks may slow down the machine process and it may not be done every single time. It\u0026rsquo;s also possible the quality check results were recorded on paper and then entered into a computer later, which can lead to quality issues. Note that in the diagram the Service and Quality data are ingested directly into the Data Warehouse. This is because the data is typically entered using a computer and is stored in a local or cloud database. So there may be a way to query the data directly and feed it into the Data Warehouse and have no need to temporarily store them in the Raw Data Landing Zone\nData Ownership # Instead of looking at the data flow purely in the sense of how the data flows, let\u0026rsquo;s modify this to include the \u0026ldquo;owners\u0026rdquo; of the data. \u0026ldquo;Owners\u0026rdquo; are the business units who define the requirements for what data needs to be collected so they can be use it to run their part of the business. These are the ones who say \u0026ldquo;I need this metric/KPI, and this is the data I need to calculate that.\u0026rdquo;\nIn most companies you\u0026rsquo;ll also see another form of data ownership, which is who owns the data and data pipeline from a technical standpoint. For example, if you query a database and the data has an error, who do you contact to fix that? There\u0026rsquo;s typically a technical team that is responsible for those types of issues. This post is not about this latter group of technical owners, but really about the business owners.\nA Single Company Owns Everything # Let\u0026rsquo;s assume all the business and technical owners/organizations are at a single company called Company A that designs, makes, and runs the machine(s). Everything is internal and there are no external vendors who own these functions.\nThe \u0026ldquo;owners\u0026rdquo; are shown in the chart below. To make this easier to read, the boxes with red borders (Engineering, Service, and Quality) are the business owners. The other teams work with the data and build things (e.g. Dashboards) with it, but ultimately the business owners are the ones using this data to reduce costs or increase profits\nflowchart TD subgraph Engineering M[Machine] end subgraph Service S[Service Data] end subgraph Quality Q[Quality Data] end subgraph IOT[IoT Group] C[Raw Data Landing Zone] end subgraph DT[Data Team] DW[Data Warehouse] end subgraph A[Analytics] DS[Data Scientst] end style Engineering stroke:#ff0000 style Service stroke:#ff0000 style Quality stroke:#ff0000 M --\u003e IOT DW --\u003e A S --\u003e DT Q --\u003e DT C --\u003e DT Important: It\u0026rsquo;s important to recognize something here about data ownership. Each organization defines their data requirements based on what they need to run their part of the business. This is in contrast to an org like the the Analytics Org. People in the Analytics Org aren\u0026rsquo;t defining what data needs to be collected, they are using data that was generated because of the requirements defined by another part of the company. In other words, people in the Analytics org are using data that is a byproduct of some business process, they are not defining why that data needs to be collected in the first place. That being said, it is possible that somebody in the Analytics org will specify additional data that needs to be collected to support a business use case. And then the business unit that asked for that use case will justify to somebody else (e.g. Engineering) why they should collect that data. But even in this case, the Analytics org doesn\u0026rsquo;t \u0026ldquo;own\u0026rdquo; that requirement. It\u0026rsquo;s the business unit that owns it.\nTo make this concept of ownership more concrete, let\u0026rsquo;s take the Service org as an example. Service is responsible for ensuring the machines are running properly. If the machine is not running properly somebody will contact the service organization to fix the machine. The service org provides these basic functions\nCall center personnel and to interact with the machine operators and other personnel at the customer(s) Remote and onsite troubleshooting Remote (if possible) and onsite fixes Scheduling onsite visits Dispatching technicians to the machine site if an onsite visit is needed Parts inventory and ordering parts From a data requirements perspective, this translates into\nRecording Customer Contact information and machine (e.g. the machine serial number) details Tracking of machine operator/customer interactions with customer support Documenting what repair/service work was done on the machine and who did it What work was performed and date/time when it was done Work time(s) and travel time(s) for each time work was done Parts used or changed Identification of which system or part in the machine was causing the issue Overall tracking of a Service Ticket Open (when the service department was alerted to the issue) and close (when the issue was resolved) date and time. Total time spent on work and travel Total monetary cost of this service ticket to the customer and the service org The list above would be provided to a technical team as the requirements for an application or software service(s) where this data can be collected (e.g. using a web form), viewed, modified, stored, and exposed to people inside and outside the service org.\nNote that the service org might not collect and store Machine Data (e.g. sensor data) as part of their daily operations. However, they may need machine data to understand the behavior of the machines so they can use it to diagnose problems. So instead of storing this data in their own systems, they may get this data from the Data Warehouse.\nOwnership Spread Across Multiple Companies # The nature of ownership and incentives can change quite drastically if all of these functions are at different companies. Let\u0026rsquo;s change the structure now so that different companies or service providers own a different piece of this diagram.\nFirst, let\u0026rsquo;s assume that the Machine is built by a Machine Builder, which is a company that builds machines that somebody else uses. A basic example of this is something like a commercial water heater. The company that makes these water heaters probably sells millions of them per year. However, while they design and manufacture the water heaters, they aren\u0026rsquo;t heavily involved in installing or servicing them. Outside of identifying potential product defects that effect a large percentage of customers, they aren\u0026rsquo;t interested in collecting near real-time data from each heater. Other examples of machines built by Machine Builders include CNC Machines, Electrical Submersible Pumps used in Oil \u0026amp; Gas, Electric Vehicle Battery Testing Equipment, etc.\nIn the diagram below, each company is a different color. You can contrast this to the single company chart above, where all the business functions are owned by the same company. To make it easier to understand what each company is doing, I\u0026rsquo;ve modified the text in the boxes to describe what they are doing. The data is a by-product of this (e.g. Servicing machines creates service data).\nflowchart TD subgraph MB [Machine Builder] DSM[Designs and Sells Machine] end subgraph Customer M[Uses Machines] Q[Measures Quality] end subgraph TPS[Third Party Service] S[Services Machines] end subgraph ITSP[IT Service Provider] RDLZ[Collects Raw Data] DT[Performs Data Engineering] DW[CreatesData Warehouse] end subgraph AC[Analytics Company] DS[Builds Analytics] end style MB fill:#9fc5e8 style TPS fill:#ffd966 style Customer fill:#8fd24b style ITSP fill:#e69138 MB -. Sells Machine To .-\u003e Customer Customer --\u003e ITSP S --\u003e ITSP Q --\u003e ITSP ITSP --\u003e AC RDLZ --\u003e DT --\u003e DW AC-. Provides Analytics .-\u003e Customer To understand what\u0026rsquo;s going on in this diagram, let\u0026rsquo;s assume the Machine Builder sells some sort of complex machine for manufacturing. There is a customer who buys many of these machines and runs them in factories across the world. The machines are serviced by different companies depending on location. As part of the customer\u0026rsquo;s contract requirements with the service providers, they ask that their service data be made available. The customer decides they want to provide more insight into their factory operations, so they pay an IT Service provider to collect and process all the data and make it available in a centralized data warehouse. The customer also contracts with an Analytics company to provide some sort of Analytics (e.g. Dashboards and KPIs) they can use to monitor their machines and improve operations.\nThere are lots of variations of the diagram above. Maybe the Machine Builder sells an IoT subscription where each machine automatically uploads data to a cloud service and the customer can access it. In that case, maybe the IT Services Provider is not required and the Analytics company can directly access that data.\nIt could be that the Service company and the Machine Builder are two separate companies under the same international conglomerate and they work together to provide Analytics. When the customer buys the machine they get some basic pre-built dashboards for no additional cost, but there is also a subscription that provides more comprehensive service dashboards and other features (e.g. notifications).\nHow these different companies (boxes) are organized and who offers what service can change the incentives around what data is offered. If the Machine Builder only builds and sells machines, their machines might be designed to expose the bare minimum of data. If the Machine Builder sells an IoT service, they might be motivated to collect and provide more data, but at different data subscription tiers. If the Machine Builder and the Service company are closely aligned, maybe they will have access to data that the Analytics company will not get access to.\nThe customer is not automatically passive in all this. A very large customer might say they need the machine builder to update the machines to expose some data. If the machine builder says no, the customer might start considering a different machine builder for their next factory.\nThe Incentives around Collecting Data # Realistically, nobody starts with the idea of collecting and exposing as much data as possible just for the sake of doing it. Collecting and exposing data has costs in terms of the time to build the functionality, the hardware and software people needed to implement it in the machine, and the financial costs to do it. Somebody has to conclude that it\u0026rsquo;s worth it to expose that data. For a company and the business org under it, that usually means making that data available will increase profits by increasing sales or driving down costs.\nWhether we are talking about all the business functions being under a single company or across multiple companies, it\u0026rsquo;s important to realize that the incentives between different business functions motivate the decisions around what data to expose. Rather than provide generalities, I\u0026rsquo;ll provide examples.\nAn Engineering (Machine Builder) org focused their \u0026ldquo;post-sale\u0026rdquo; troubleshooting efforts on identifying the parts of the machine that were breaking and driving up warranty costs. To identify the root cause of the issue, they identified the steps (using the Error log) that led to the issue. For example, Step A -\u0026gt; Step B -\u0026gt; Step F is more likely to lead to issue than Step A -\u0026gt; Step B -\u0026gt; Step D. Another name for this is Root Cause Analysis by using a Causal Tree. By recreating this in their lab environment, they were able to identify what was going on and if it was more pragmatic to update the software/hardware or eat the warranty cost. However, the Service Org was seeing many other machine problems that could not be diagnosed using this method. They needed sensor data to help determine what was going on. But since the Engineering group felt their troubleshooting methodology was sufficient for their needs, they consistently deprioritized adding new sensor data to the machine data exports, which hindered the Service Org. An Engineering Group wanted to add a lot of telemetry to their product, but there were concerns about legal risks. After classifying their use cases as risky or less-risky, it was not clear there was a good business model for only implementing the less-risky use cases. In other words, there was not much value to the customers or the company to add that telemetry. A Service Org wanted to move from reactive (e.g. fixing an issue after it occurs) to proactive (identifying and alarming on possible issues before they occur) service. However, due to budget limits on staffing, they couldn\u0026rsquo;t realistically act on a proactive alarm in less than 96 hours. Based on this, they where not really motivated to make the business case to engineering to increase the data frequency or what data was collected (e.g. instead of just a daily average, have an hourly minimum, maximum, average, etc) In a more dysfunctional company, this is also something that can be weaponized. For example, the service org can be accused of \u0026ldquo;Not being responsive to proactive issues\u0026rdquo; even if they can\u0026rsquo;t realistically respond to them because of limited personnel. In that case, the service org might actively discourage any improvements in data collection. Instead of talking about Machine Data, here is an example from Service Data. When service personnel went onsite to a customer, they were required specify (in the service ticket) the part of the machine they worked on and what was broken. However, in many cases, it was not 100% clear what the root cause of the issue was, resulting in multiple parts being changed. So the data that was entered into the service ticket reflected what was changed, but not what the cause of the breakage was. From a Data Analysis and Predictive Model perspective, this added label noise because it was hard to determine actual reason for the issue. Fixing this would require updating the service software + UI used by the service personnel, and retraining all the service personnel on the new system. Looking at this purely from the perspective of what value cleaner data and better analytics would provide if they made this change, it was not clear that it needed to be done. As a real world example of this, there was a water pipe that was routed over a circuit board. If that pipe leaked, it would leak directly onto the circuit board and cause a failure. Since the circuit board was much more expensive than the pipe, the former was typically entered as the \u0026ldquo;cause.\u0026rdquo; This was the type of issue where implementing an engineering fix (e.g. moving the pipe) would provide much higher ROI than adding a bunch of data to predict this issue. There are machine issues where fixing something after it breaks is the best course of action. For example, there might be a 50 cent rubber gasket that leaks, but it takes 6 hours of labor to replace that gasket. In that case, it might not be worth it to collect a lot of data to try and predict when it is going to fail. Instead, you replace it every year as part of a yearly scheduled maintenance because it isn\u0026rsquo;t worth it to optimize when that part should be changed. An Engineering and Services Org was struggling to clearly define which parts should be reengineered (e.g. like a recall), which should be covered under warranty, which should be fixed by service, and which should just be left to break and fixed reactively (customers without a service plan would have to pay for it themselves). In other words, who should pay for the fix? Since they all agreed that the current confusion was untenable, it was much easier for all business owners to make the case (and find budget) for better data so they could better classify how issues should be handled. As you can see in all these examples, the goals of all the business owners define how motivated they are to push for data collection and data quality improvements. And it\u0026rsquo;s not just each business owner themselves, but these motivations are also driven by how collaborative or adversarial these relationships are between these owners. Every situation is different, and can evolve (or devolve!) over time.\nHow does this impact Analytics and Data people? # As somebody who works in a data analytics role, data issues can be really frustrating and you can feel like screaming \u0026ldquo;Can\u0026rsquo;t you see this change with the data is needed? Why is it so difficult for you to understand? You are the one who asked me for this use case, and this is what needs to be done to be successful.\u0026rdquo; It\u0026rsquo;s not always clear why some data is not collected/exposed or why some major data issue is not taken seriously and addressed.\nOne thing that has helped my sanity in these situations is to defocus from the particular project I\u0026rsquo;m working on and think about the business motivations. Many times, this provides the perspective needed to understand why a change is not going to happen and the real answer is there is no business justification for resolving this. And in many cases, if a fix requires a jump from one ownership box to another, that\u0026rsquo;s one box too far.\nIf you are pushing for a change, instead of beating people over the head with more analysis and trying to convince them you are right, step back and look at motivations and business use cases. People understand \u0026ldquo;The impact of not resolving this data issue is $100K\u0026rdquo; a lot more than 50 slides of data limitations and the issues it causes. In doing so you might realize the fight you\u0026rsquo;re having isn\u0026rsquo;t worth the effort you are putting into it. I\u0026rsquo;ve certainly been in a situation where the expected ROI of a project was very high ($100K+/year), but after some analysis we realized that with the use case defined the way it was, that number was totally unrealistic and that number was revised down quite a bit and the project was put on hold.\nUnfortunately, it might take a long time to be able to see these business motivations, especially as a new employee or with new stakeholders. You also need to form the right relationships with the right people to really understand what is driving decision making. But I think it\u0026rsquo;s worth the suffering for you to build that knowledge and those relationships.\nAt this point I hope you have a basic understanding of how data flows and how different business owners can drive what data is collected and how it\u0026rsquo;s used. In my next post I\u0026rsquo;m going to provide more concrete example of machine data, and also use this to provide more examples of how business requirements drive and define what data to collect\nFinal Thoughts # To wrap things up, let me just reiterate some key takeaways. Data is collected because somebody thought it would be useful to collect, not just for the sake of collecting data. In a commercial context, the people who define what is useful are business owners, who want to use this data to reduce costs or increase sales. The business owners are motivated in different ways, and it\u0026rsquo;s important to keep their motivations and incentives in mind when trying to understand what data is collected and why they find it useful. Also remember that different business owners have to work together, and how aligned their goals are influences how they perceive each other.\nOne last point. Over time, the cost of collecting data goes down, and the need to collect data goes up. A 30 year old machine might not expose any data at all, and the only way to even get data from it would be to physically add sensors and electronics to capture and export that data. In 2023, unless a machine is extremely simple, it\u0026rsquo;s likely a machine will collect some data, even if it doesn\u0026rsquo;t give customers access to that data. Compared to 30 years ago, there\u0026rsquo;s no need to convince anybody that collecting data from a machine is something that needs to be done. It\u0026rsquo;s really more a question of what data needs to be collected and how to make money from that data.\n","date":"19 January 2024","externalUrl":null,"permalink":"/posts/businesses-stakeholders-define-why-data-should-be-collected/","section":"Posts","summary":"This post will be the first of two that discusses how organizational dynamics, stakeholder incentives, and the goals of the business drive decisions about what data is collected and why.  What is discussed is relevant to companies trying to do predictive maintenance on industrial and commercial machines, but it applies to other industries as well.","title":"Businesses Stakeholders are the ones who Define why Data Should be Collected","type":"posts"},{"content":"Hi, I\u0026rsquo;m Ujval. I\u0026rsquo;ve worked as a Data Scientist for over a decade now. For eight of those years I worked in customer facing roles, so I was able to see what different companies were trying to do with data and machine learning. As far as industries, I’ve worked on use cases in Oil \u0026amp; Gas Extraction and Processing, Medical Equipment, Automotive, Pharmaceutical, and Consumer Packaged Goods. In all my jobs I worked with many different stakeholders and experienced both the business and technical side of things, which made me better at understanding why people wanted to do things, and not just the technical how.\nWhile it wasn\u0026rsquo;t intentional, a large portion of my work was focused on Industrial Equipment and Machines. Consequently, my first blog posts will be about machines, machine data, and the business of being a Data Scientist who is expected to build things which deliver an actual financial benefit to those industries. The goal is to provide others working on similar projects with enough context to ask better questions and make better decisions, so the things they build will actually be used, and not just end up in a long forgotten file repository.\nMy posts tend to be long-form. Part of why I write this way is the squishy nature of the topics I write about, and also because I don\u0026rsquo;t want to make assumptions about what readers already know. So In all my posts I try to explain the concepts and terms used.\nI hope my writings will be informative and provide clarity to somebody else, especially people who are just starting out.\nWith the arrival of LLMs, I feel like I have to add this - the writing on this site is 100% me and 0% LLMs. Writing is one of the few activities I have where I can get into a flow state, because I can just write without interruption, even if what I\u0026rsquo;m writing is a disorganized mess. There are other things I enjoy doing, but it\u0026rsquo;s hard to not get interrupted. For example, I enjoy coding, but it\u0026rsquo;s full of interruptions - looking up things, debugging an error, and trying to refactor so things stay in a logical order. I write here because I enjoy it, not because I have to, so there\u0026rsquo;s no value in me using an LLM to write any part of this blog.\n","date":"1 January 2024","externalUrl":null,"permalink":"/about/","section":"Welcome","summary":"","title":"ABOUT","type":"page"},{"content":"Writings on the Intersection of Data Science, Organizational Dynamics, Industrial Machines, Service Operations, and Trying to Understand When Work Is Actually Useful\n","date":"2 June 2019","externalUrl":null,"permalink":"/","section":"Welcome","summary":"","title":"Welcome","type":"page"},{"content":"","externalUrl":null,"permalink":"/authors/","section":"Authors","summary":"","title":"Authors","type":"authors"},{"content":"","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"}]