In our special Java Daily edition, we’d like to introduce you to Mani Sarkar. He was kind enough to share his experience on 10 Java related questions.
Mani Sarkar is a passionate developer mainly in the Java/JVM space, currently strengthening teams and helping them accelerate when working with small teams and startups, as a Freelance Software, Data, ML engineer.
A Java Champion, Oracle Groundbreaker Ambassador, JCP Member, OpenJDK contributor, thought leader at developer communities and involved with F/OSS projects like @graalvm, @wandb, nbQA-dev/nbQA, and others. Writes code, not just on the Java/JVM platform but in other programming languages as well, hence likes to call himself a polyglot developer. He sees himself working in the areas of core Java, Hotspot, GraalVM, Truffle, VMs, Performance Tuning, Data, and AI/ML/DL/NLP.
An advocate of a number of agile and software craftsmanship practices and a regular at many talks, conferences and hands-on-workshops – speaks, participates, organises and helps out at many of them. Expresses his thoughts often via blog posts (on his own blog site, DZone, Medium and other third-party sites), and microblogs (tweets). You can read more about him by going here.
Java Daily: Do you think we are near that all commercial fields will integrate ML/AI inside their systems?
We may not be able to say with certainty that all commercial fields will integrate but we can certainly say that driven by many different market, social and environmental forces a large number, if not the majority of them would certainly do this. The others who do not may end up using systems that already have these integrations built into them.
Java Daily: Data preprocessing and preparation (ex. exploratory analysis, cleaning, gathering data) or big algorithm pipelines? What percentage would you invest in one direction, how much in the other?
All these components of the pipeline you mentioned are important and it’s very subjective from one project to another, from one class of data gathered to another. So far it is seen that the data aspects of the pipeline take more time and need more human intervention than the non-data aspects i.e. the parts that are automated like building machine learning models (model algorithms and monitoring pipelines) from the processed data and the cycle from there onwards. If I have to give some rough estimate then I would say be ready to spend 60 to 70 percent on getting the data right and understanding the data (data preprocessing and preparation) while the rest of the time or resources could be allocated to building models from it and monitoring it. And only after finishing a couple such iterations will we know how much more time to allocate and for which components of the pipeline and why.
Java Daily: Continuing the question – what is the perfect dataset?
This is an ideal question which I’m not sure if there is a standard answer to, I will still try to answer – a perfect dataset may never exist but if we have to define one, it would be one that does not change with time, one that has little or no errors or mistakes, no missing values, no duplicates, none of the statistical issues that Data Scientists have to check for and change. All the features are balanced both individually and also in combination of groups. A dataset that is a very good representative sample if not the complete representation of the domain it is related to. A dataset that is conducive to many if not all of the model algorithms out there, such that when a model is built from it, then the accuracy (or any other metric) of the model is always very high and consistent especially when presented with unseen real-world data, which means it must sustain the trials of time, velocity of data and edge-cases presented by the real world. After having read all the criteria many of you (especially my Data Scientist friends) would immediately think that such a dataset is a fantasy come true — is it even possible to have such a dataset.
Java Daily: In short, why and which development tools are your favorite when developing DL/ML/AI systems. Can you please describe different languages, use cases, tools like libraries if you have something interesting, even IDEs?
Python and R are leading in this area when it comes to developing AI/ML/DL systems – among them the top libraries are the likes of PyTorch, TensorFlow, Scikit-learn, Fastai, Pandas, NumPy, SciPy, Matplotlib, and the likes. PyCharm has been the most popular IDE of all of them, but many use VSCode or a notebook like environment i.e. Jupyter Lab or Jupyter Notebook to make their development ends meet. There are a lot of variants of the later available in the form of online services or portals where such developments can be performed i.e. Google Colab, Neptune, Kaggle, etc…
The reason we have this family of languages, tools, libraries, frameworks, etc.. dominating the AI field roots back to academia and researchers where experimenting and prototyping to prove something was encouraged and languages like Python, R, Julia, Matlab became popular as it was to write and understand such languages and also develop prototypes and experiments using them. This led to a massive explosion of libraries and packages created and shared by the communities of the various languages, and others came in and built on top of it like any other thriving community.
I have personally found PyTorch and FastAI as great tools to work with, as they abstract away many aspects of an AI/ML development process leaving you with read-able and maintainable code. I’m usually switching between PyCharm and VSCode, sometimes even Sublime Text and CLI to make progress with different ML projects. But I’m equally invested in using Jupyter Lab and Kaggle’s interfaces (very rarely I may use Google Colab) – for those who are familiar with REPL based development, the notebook paradigm offers the best of both worlds (REPL and traditional coding), I have spoken about this in one of the jOnConf conferences last year.
Java Daily: Do you think that polyglot developers are the future of this industry? And in the general software development field what is your opinion on being very proficient in a certain language or framework vs. working with different languages.
This is an interesting question, even though we may have not realised (or accepted) but many Java developers or developers of other programming language have knowingly (or unknowingly) always been polyglot developers of some sorts. And this is becoming more and more clear when we analyse the variety of systems and platforms any organisation at this point in time are dependent on. Every framework or technology used by them is dependent on a service or product (or integration) internally or with a third-party — developers who support them and work closely around these environments automatically tend to become polyglot developers supporting these heterogeneous environments and are no longer just Java developers or NodeJS or Ruby developers for that matter. So knowing a language well (being proficient) is always going to help and be recognised and be in demand but knowing multiple languages (even if we don’t know them in-depth) is has also been growing in demand for sometime due to the nature of the growing heterogenous technological environments around us (as mentioned in the earlier aspect of the answer above). As the latter would enable us to develop and support our solutions and those created by others both individually and as part of a team of developers.
Java Daily: Can Java and the AI field become best friends? (or more profound – With growing use of ML/AI/DL do you think Java manage to create symbiosis with the field)?
If we consider ML/AI/DL to be programming techniques, and Java is a technology or a tool that can be used to develop or implement such techniques, then we can say that Java certainly meets that goal. There are already many areas where Java is playing a role in the AI field, for e.g. Apache Spark, Apache Zeppelin, grCuda, and many such implementations of Java are already in the wild. There are many AI/ML libraries out there like DeepLearning4J, Apache OpenNLP, DeepNetts, and most recently Tribuo (from Oracle Labs), see https://github.com/neomatrix369/awesome-ai-ml-dl/blob/master/details/java-jvm.md#java for details of many such libraries, packages and frameworks. Java is a tried and tested, stable, and reliable platform — and this leads to building performant, scalable and robust systems (or applications). Not many other platforms may meet such specifications to help enable, build, and deploy real-world performance applications at scale.
Java Daily: As a passionate speaker, what do you think is the future of the conferences – back to the classic live gatherings, or more and more virtual ones?
I think a mix of the two is what will work for many of us. But it would certainly be good to meet fellow-developers and professionals in the industry in person like we did in the years before and have a huddle.
Java Daily: Can neural networks learn to predict pragmatic inferences?
Maybe in time with more research and development in this area, and with checks and balances put in place, we could revisit this line of inquiry. It may be a bit early to say with certainty that we can achieve this very soon — but one-step at a time.
Java Daily: What is your opinion on the best way to detect fake news?
This question is similar to asking “what is the best way to detect real/genuine news?”, just as we get better at detecting them, the creators or sources of “fake” news are getting better and making them look more and more genuine — so it’s a hard problem to tackle and eradicate totally. It’s a complex problem to understand and there are moving parts to it. We do not have control over or oversight of them as a whole, hence no solution may hold the pressures exerted by the problem for long.
Java Daily: Can you recommend a good book (can be both technical and non-technical)?
I’m still very fascinated by the book “Working Effectively with Legacy Code” by “Michael Feathers” as a technical book to read, the techniques and ideas continue to be relevant. A non-technical book which I always recommend to my fellow developer/engineer friends is “Deep Work” by “Cal Newport”
Is there anything else you would like to ask Mani Sarkar? What is your opinion on the questions asked? Who would you like to see featured next? Let’s give back to the Java community together!