{"componentChunkName":"component---src-templates-article-js","path":"/projects/pytextree","result":{"data":{"markdownRemark":{"frontmatter":{"title":"PyTexTree - LaTeX Trees!","date":"1.1.2020","cover":{"childImageSharp":{"fluid":{"base64":"data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAAPABQDASIAAhEBAxEB/8QAGQAAAgMBAAAAAAAAAAAAAAAAAAMBAgQF/8QAFAEBAAAAAAAAAAAAAAAAAAAAAP/aAAwDAQACEAMQAAAB58soYx4f/8QAGhAAAwEAAwAAAAAAAAAAAAAAAAECEQMSIf/aAAgBAQABBQKdK9LWVxvtDeD0/8QAFBEBAAAAAAAAAAAAAAAAAAAAEP/aAAgBAwEBPwE//8QAFBEBAAAAAAAAAAAAAAAAAAAAEP/aAAgBAgEBPwE//8QAFhABAQEAAAAAAAAAAAAAAAAAECEx/9oACAEBAAY/Am6//8QAHRAAAgIDAAMAAAAAAAAAAAAAAAERITFBUWFxkf/aAAgBAQABPyFIRzT2I0lcodBeCi9D8jKrKRA6n6f/2gAMAwEAAgADAAAAEPMP/8QAFBEBAAAAAAAAAAAAAAAAAAAAEP/aAAgBAwEBPxA//8QAFBEBAAAAAAAAAAAAAAAAAAAAEP/aAAgBAgEBPxA//8QAHRABAQACAwADAAAAAAAAAAAAAREAITFBUWGxwf/aAAgBAQABPxATi8pcPhwMwggzmfq5MQQWt3lYgil4xYXedmo9/WH1m2ir3n//2Q==","aspectRatio":1.3333333333333333,"src":"/static/30850808830830d188757c15fafa06ac/69755/tree.jpg","srcSet":"/static/30850808830830d188757c15fafa06ac/49b36/tree.jpg 512w,\n/static/30850808830830d188757c15fafa06ac/16310/tree.jpg 1024w,\n/static/30850808830830d188757c15fafa06ac/69755/tree.jpg 2048w,\n/static/30850808830830d188757c15fafa06ac/c02f3/tree.jpg 2059w","sizes":"(max-width: 2048px) 100vw, 2048px"}}}},"html":"<blockquote>\n<p>As explained in the <a href=\"/projects/texmindmapper\">TexMindMapper post</a>, I wanted to visualise my thesis as a mind map, or any other logical structure for that matter. Naturally, to achieve visualisation, one needs a clearly defined structure, which hopefully also is logical and easily accessible. In other words, I needed a way to convert the text in my thesis written in LaTeX, into something a bit more data-analysis friendly.</p>\n</blockquote>\n<h2>There most likely already exists something, right?</h2>\n<p>As I was writing my thesis in LaTeX, I had hoped that some fellow nerd had solved the problem of parsing the raw text into some nice data structure - preferably into some tree, which would be easy to visualise. This, however, was not exactly the case. I did indeed find several libraries for editing LaTeX files with python and compiling them, but not exactly anything that I could use for my purposes. The one closest to what I had in mind was <a href=\"https://github.com/alvinwan/tex2py\">tex2py</a> but unfortunately, it felt far too cumbersome and a bit buggy for my taste; I would had used more time trying to learn it than whipping up my own.</p>\n<p>Therefore I decided to implement a simple library for converting the LaTeX files into tree structure.</p>\n<h2>What did I need, exactly?</h2>\n<p>I was lucky, considering the planning phase, that I had such a clear idea for what I needed as an output from the script: a list of the nodes, and edges between them. So basically more a graph than a tree, but these two are often rather interchangeable, at least for an engineer... Additionally, I hoped to present some informative statistics, such as word count or number of images, tables and citations, on the chapters in my mind-map. Parsing the text was a natural place to perform such analysis as well.</p>\n<div class=\"gatsby-highlight\" data-language=\"tex\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-tex line-numbers\"><code class=\"language-tex\"><span class=\"token function selector\">\\documentclass</span><span class=\"token punctuation\">{</span><span class=\"token keyword\">article</span><span class=\"token punctuation\">}</span>\n<span class=\"token function selector\">\\begin</span><span class=\"token punctuation\">{</span><span class=\"token keyword\">document</span><span class=\"token punctuation\">}</span>\n   <span class=\"token function selector\">\\section</span><span class=\"token punctuation\">{</span><span class=\"token headline class-name\">S1</span><span class=\"token punctuation\">}</span><span class=\"token function selector\">\\label</span><span class=\"token punctuation\">{</span><span class=\"token keyword\">sec:S1</span><span class=\"token punctuation\">}</span>\n   Lorem ipsum dolor sit amet. Culpa laboris ut tempor deserunt magna fugiat aliqua.\n\n   <span class=\"token function selector\">\\subsection</span><span class=\"token punctuation\">{</span><span class=\"token headline class-name\">S1.S1</span><span class=\"token punctuation\">}</span><span class=\"token function selector\">\\label</span><span class=\"token punctuation\">{</span><span class=\"token keyword\">sec:S1S1</span><span class=\"token punctuation\">}</span>\n   At vero eos et accusamus. Lorem: <span class=\"token function selector\">\\ref</span><span class=\"token punctuation\">{</span><span class=\"token keyword\">sec:S1</span><span class=\"token punctuation\">}</span>. Quis minim est fugiat minim.\n\n   <span class=\"token function selector\">\\section</span><span class=\"token punctuation\">{</span><span class=\"token headline class-name\">S2</span><span class=\"token punctuation\">}</span><span class=\"token function selector\">\\label</span><span class=\"token punctuation\">{</span><span class=\"token keyword\">sec:S2</span><span class=\"token punctuation\">}</span>\n   Ipsum consectetur occaecat ullamco duis mollit nostrud cillum.\n\n\n<span class=\"token function selector\">\\end</span><span class=\"token punctuation\">{</span><span class=\"token keyword\">document</span><span class=\"token punctuation\">}</span></code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p><em>Example LaTeX document.</em></p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-text line-numbers\"><code class=\"language-text\">Document\n├── Section: S1\n|   └── Subsection: S1.S1\n└── Section: S2</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span></span></pre></div>\n<p><em>The LaTeX document above, presented in a tree form</em></p>\n<p>The structure of any work of literature can be considered as a tree, where the sections work as nodes, subsections as their children, sub-subsections as grandchildren etc.. These nodes can then have properties, such as word count, themes and images. However, the line of which content belonged to which node was not obvious for me straight away: was the text in a subsection also to be included into the encapsulating section? In the end I decided to go against it, as it would provide more clear result. Moreover, the content could then be accumulated over the entire chapter to present the entire content of the node and its children as a whole.</p>\n<h2>Time to make it do what I want it to do</h2>\n<p>I used <a href=\"https://anytree.readthedocs.io/en/latest/\">anytree</a> as a base for the tree structure to avoid reinventing the wheel. This wonderful library saved a ton of my time. Using their <code class=\"language-text\">NodeMixIn</code> class as a base for my nodes helped me to simply add the needed functionality for my implementation.</p>\n<p>The LaTeX format was generally simple to work with, but also provided its own challenges. The vast options on how to present some information seemed overwhelming at first to analyse reliably yet effectively. For example, the options in the beginning, the multitude of packages and different styles caused some headaches when writing the regular expression to extract the needed information. I decided to go with the simplest path and add more features later.</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-text line-numbers\"><code class=\"language-text\">\\environment{&lt;name&gt;}{&lt;begin-code&gt;}{&lt;end-code&gt;}</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span></span></pre></div>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-text line-numbers\"><code class=\"language-text\">\\environment{&lt;name&gt;}[&lt;n&gt;]{&lt;begin-code&gt;}{&lt;end-code&gt;}</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span></span></pre></div>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-text line-numbers\"><code class=\"language-text\">\\environment{&lt;name&gt;}[&lt;n&gt;][&lt;default&gt;]{&lt;begin-code&gt;}{&lt;end-code&gt;}</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span></span></pre></div>\n<p><em>Three different ways to define an environment in LaTeX, with and without options</em></p>\n<p>Then there was also the part where my meticulous nature took the best of me (I had the time to procrastinate on the other stuff, right). As, you see, the text documents generally would never be an issue for the performance. However, I wanted to make the algorithm for traversing the text at least looking nice. I did not want nasty performance hogs, such as multiple scans of the document or recursion. So I spent a bit too much time tinkering with the parsing than I initially had planned.</p>\n<p>The content within the environments and sections would also be analysed in a very simple manner. Each of the following were extracted for each generated node (environment, section or similar) if they existed:</p>\n<table>\n<thead>\n<tr>\n<th>Description</th>\n<th>Example</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>List of found LaTeX commands</td>\n<td>[<code class=\"language-text\">\\textbf</code>, <code class=\"language-text\">\\ref{fig:my_fig}</code>]</td>\n</tr>\n<tr>\n<td>List of comments</td>\n<td>[\"% A comment\"]</td>\n</tr>\n<tr>\n<td>Number of words, excluding comments and commands</td>\n<td>527</td>\n</tr>\n<tr>\n<td>LaTeX label of the node if one exists</td>\n<td>\"fig:my_graph\"</td>\n</tr>\n<tr>\n<td>Cited labels</td>\n<td>[\"Lamport1984\", \"Rossum1991\"]</td>\n</tr>\n<tr>\n<td>Text contents</td>\n<td>[\"This is a paragraph.\", \"And another.\"]</td>\n</tr>\n</tbody>\n</table>\n<h2>What came of it?</h2>\n<p>But it worked! I managed to cram in the needed functionality too! Now I could simply import the library in normal python manner and get analysing. Here are some of the most interesting features:</p>\n<ol>\n<li>Parse multiple files ( nice for large LaTeX projects)</li>\n<li>Traverse document like a tree</li>\n<li>Figures, Tables, Lists and other elements as their separate nodes</li>\n<li>\n<p>Information on the nodes based on the content</p>\n<ol>\n<li>Texts stored in the nodes</li>\n<li>Citations as labels</li>\n<li>Comments also stored within th enodes</li>\n<li>Commands as well! (Makes references available)</li>\n<li>Word count within the element</li>\n</ol>\n</li>\n<li>Pretty print function to make the devs' life better ;)</li>\n<li>\n<p>Export into a graph</p>\n<ul>\n<li>Turns nodes into, well, nodes</li>\n<li>Turns child relations into edges</li>\n<li>Includes sensible information form the node (not full texts, comments etc large text content)</li>\n<li>Option to export into <code class=\"language-text\">.csv</code> file</li>\n</ul>\n</li>\n</ol>\n<p>I had worked with <a href=\"https://gephi.org/\"><em>Gephi</em></a> before to analyse graphs and had found it very useful. I therefore also added a simple export script for generating <code class=\"language-text\">.csv</code> files that can be directly imported to <em>Gephi</em>, so you could analyse your graph with relative ease.</p>\n<h2>Share the goodies for others as well, kid</h2>\n<p>By the time I had finished a version of this script, I was rather happy with it - both quality and usefulness -wise. I thought that perhaps by sharing this to the ever expanding space of <a href=\"https://pypi.org/\">PyPi</a> I hopefully could reduce the effort of some poor fellow battling with similar problems in the future (and also to install the package conveniently should I need it later in other projects).  Therefore, now anyone could install the package with simply using pip:</p>\n<div class=\"gatsby-highlight\" data-language=\"bash\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-bash line-numbers\"><code class=\"language-bash\">pip <span class=\"token function\">install</span> pytextree</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span></span></pre></div>\n<p>Excellent! I had just developed a nice coherent module for a bigger project, and managed to share it with the world - hopefully to bring a little sunshine to someone's cloudy day! If you are interested in doing something similar, they have some <a href=\"https://packaging.python.org/tutorials/packaging-projects/\">excellent instructions</a> available!</p>\n<hr>"}},"pageContext":{"slug":"pytextree","navContext":{"next":null,"prev":{"path":"/projects/exifextractor","title":"ExifExtractor - Find your edits","slug":"exifextractor","links":["https://www.github.com/PebbleBonk/ExifAnnotator"]}},"links":["https://www.github.com/PebbleBonk/pytextree","https://pypi.org/project/pytextree/"]}}}