Just for the hell of it, here is the uncut version of my reply to the question on our major technical challenges for the FastCo article titled How Do I (Really) Know If My Startup Is Infringing On Trademarks?
For modelling the major challenge is the legal domain itself: the corner of trademark law we are modelling is based on very loosely formulated legal standards and a vast body of case law determining what those standards mean in practice. The problem is trying to create a set of generalized rules matching the past decisions, because the cases can be quite inconsistent and often tied to the particulars of each individual case. Often it really boils down to quite fundamental questions of what we perceive as similar or different, as significant or insignificant. For my academic work I'm happily reading books with titles like how do words mean but fortunately our team (myself included) generally prefers coding to philosophizing.
Of course with trademarks we're always dealing with language, whether it's the trademarks (word marks) themselves or the product descriptions associated with them, and you get all the usual challenges of doing natural language processing computationally. However, there are quite a few additional challenges which are unique to this field. A trademark can in principle be in any language or many different languages at the same time or indeed in no particular language at all and we can only make educated guesses both by looking at the words themselves and their geographical (jurisdictional) context. The product descriptions are at least in a known natural language, but as text fragments they are extremely short and difficult to categorize based on the trademark data alone.
Computationally, a major challenge is the sheer volume of data we have to deal with. The full analysis requires a comparison of trademark pairs (the query versus each existing trademark) that is computationally quite demanding and time-consuming, so of course we have to do do it in multiple passes to progressively reduce the number of marks that could be close enough to warrant the full analysis. And of course we have to constantly optimize and parallelize the analysis to keep the response times acceptable in spite of a growing number of trademarks, registries and other data we cover.
Still, the most important technical challenge for us is understanding the needs of the customer and trying to solve that. For instance, we could say we only deal with trademark law, but our customers don't really have trademark law problems. The actual issues are naming and brand management, and while trademark law does play a key role, we are not afraid to add other data sources as well, such as dictionaries to get word meanings in hundreds of languages, and industry-specific data sources like names of mobile apps or pharmaceuticals. And then trying to keep a very complex process as simple as possible by giving the user the search options they actually need and by presenting the results in an accessible way.