Contextualizing Ancient Texts With Generative Neural Networks

Previous work

In recent years, the field of machine learning for ancient languages has gained remarkable momentum, driven by increased digitization efforts (creating standardized datasets of texts, metadata and images of ancient written evidence), by advances in machine learning architectures (for example, the Transformer³¹) and increased computational power. This progress, spanning numerous languages, scripts and tasks, has been extensively doblockented in works such as refs. ^11,32 as well as in task-specific studies^13,33,34.

Work on restoration (including the tasks of reblockembling fragments, restoring text and enhancing quality) encompblockes a number of machine learning methods, modalities and ancient written evidence, including inscriptions in cuneiform^17,35,36, ancient Greek^14,15, Linear B³⁷, Hebrew³⁸, old Chinese³⁹, Indus^40,41,42, old Cham⁴³, Oracle Bone^16,44; as well as papyri in Coptic⁴⁵, Hebrew⁴⁶ ancient Greek^47,48; and manuscripts in old Korean⁴⁹, ancient Shui⁵⁰ and Tamil⁵¹. One of the most closely related efforts to Aeneas is work on multimodal old Chinese ideograph restoration⁵². However, replicating this approach for Latin inscriptions is limited by the quality and consistency in annotation of existing datasets. Moreover, this method is confined to the reconstruction of single ideographs, and does not extend to broader epigraphic tasks. In terms of evaluating human performance on the task of restoration, Assael et al.¹⁵ was the first to establish a measure of joint human–AI performance in a real-world setting, an evaluation framework that has since been adopted by subsequent studies⁵³.

As for the challenge of unknown text restoration, this has been approached primarily by Shen et al.¹⁹, but their application to ancient languages uses a known restoration length benchmark.

Work on ancient text attribution, both geographical and chronological, is less common, and to our knowledge, only Assael et al.¹⁵ has attempted to tackle together the three tasks of restoring, dating and placing ancient Greek inscriptions. Other notable efforts on dating include those on Kannada inscriptions⁵⁴, on Arabic manuscripts⁵⁵, on Coptic papyri⁵⁶, on old Chinese manuscripts⁵⁷, on Cuneiform tablets⁵⁸, on Oracle Bone inscriptions⁵⁹, on Korean Hanja⁶⁰, and on Greek papyri⁶¹. The only other work on geographical attribution is on Greek literary texts⁶². Although the findspot of an inscription often indicates its place of writing, geographical attribution becomes important in cases of objects that have been moved around during the ancient or medieval periods⁶³, or in light of early modern collecting habits, as well as the illicit trade in antiquities.

With regard to Latin, recent efforts have focussed on Latin literary evidence to tackle a range of tasks, from intertextuality⁶⁴, part-of-speech tagging⁶⁵, translation⁶⁶, authorship attribution^67,68 and literary text restoration⁶⁹. But despite the existence of large-scale Latin epigraphic datasets, many of which use the EpiDoc XML encoding gold-standard^70,71,72,73 and include images of inscriptions, very little work has attempted to apply machine learning techniques to Latin epigraphy—although quantitative approaches to Latin epigraphy using statistical techniques are continuously breaking new ground^74,75,76,77. Early efforts include work on the Vindolanda stylus tablets^21,22, attempting to develop an image processing and pattern recognition pipeline for character recognition. More recently²⁴, there has been work to develop a clblockifier to automate the identification and labelling types of Latin inscriptions from the poorly standardized EDCS_ETL dataset using patterns learnt from the more richly annotated EDH dataset; and applied text detection methods to segment characters and blockyse letters across a large dataset of Latin inscription images to isolate letter-cutting workshops²³.

Latin Epigraphic Dataset

Dataset generation

To create the LED, we processed the EDR, EDH and EDCS_ETL databases, resulting in the largest machine-actionable Latin inscription dataset to date (Extended Data Table 1). These databases collect inscriptions from various Roman provinces and historical periods, enhancing the diversity and temporal scope of LED. All databases were available under a Creative Commons Attribution 4.0 license via Zenodo (the open repository for EU-funded research outputs). To ensure consistency across the LED dataset, we standardized all metadata relating to dates and historical periods, converting them to numerals within the range of 800 bce to 800 ce. Inscriptions outside this range were excluded. Province names obtained from EDR, EDH and EDCS_ETL were also standardized and merged.

To render the text machine-actionable, we applied a filtering ruleset to systematically process human annotations. Historians’ epigraphic annotations (the Leiden conventions) were either stripped or normalized to preserve the closest version of the original inscribed text. Latin abbreviations were left unresolved, whereas word forms showing alternative spellings for diachronic, diatopic or diastratic reasons (for example, bixit for vixit) were preserved to enable the model to learn their epigraphic, geographical or chronological specific variations. Missing characters restored by editors (conventionally annotated within square brackets, and typically restored on the basis of grammatical and syntactical patterns and the reconstructed physical layout of an inscription) were retained. Missing characters that cannot be definitively restored by editors (conventionally represented using hyphens as placeholders, with each hyphen corresponding to one missing character) were also retained. When the exact number of missing characters was indeterminate, we used the hash (#) symbol as a placeholder to denote this uncertainty. Extra spaces were collapsed to ensure clean and concise outputs. Non-Latin characters were stripped using an accent removal function, leaving only Latin characters, predefined punctuation and placeholders. Duplicate inscriptions were excluded using their unique Trismegistos identifiers when available, and supplemented by additional deduplication using fuzzy string matching and MinHash locality-sensitive hashing⁷⁸: texts exceeding a 90% content similarity threshold were considered duplicates, resulting in the removal of one text from each identified pair. Inscriptions under 25 characters in length were filtered out to focus on substantial textual content, essential for the model’s learning and generalization capacities. For dataset partitioning, inscriptions whose numerical Trismegistos (or in alternative EDCS_ETL) identifiers ended in 3 or 4 were held out and allocated to the test and validation sets respectively, following previous work¹⁵.

Images were sourced exclusively from EDR and EDH. To maintain high data quality and ensure standardization across the dataset, we implemented an automated filtering process. This process removed drawings, squeezes and other non-photographic artefacts by applying thresholds to colour histograms, specifically targeting and eliminating images composed primarily of a single solid colour. Additionally, we utilized the variance of the Laplacian matrix to identify and discard blurry images, leveraging the principle that blurry images have lower variance in their colour continuity. The cleaned images were then converted to greyscale, as this was the predominant format in the original dataset. For each inscription, only a single representative image was kept, excluding non-inscribed surfaces.

Dataset limitations

Despite representing the largest machine-actionable corpus of Latin inscriptions compiled to date, the size of LED (16 million characters) remains a significant limitation compared with the scale of datasets typically used in state-of-the-art natural language processing research. This relative scarcity of data inevitably constrains the model’s capacity to generalize and may limit its performance on rarer epigraphic phenomena or under-represented regions and periods. Crucially, the available corpus is also subject to inherent biases, most significantly inscription survival bias, potentially skewing the data towards certain materials, locations, or historical contexts. This limitation is even more pronounced for the image modality, where only approximately 5% of the textual inscriptions have corresponding images. As a result, although saliency maps provided valuable insights for the textual modality in the geographical attribution task, the artefacts highlighted by the image saliency maps were often less interpretable by domain experts. Moreover, the task of chronological attribution could also potentially benefit from additional images, which might allow for better alignment with palaeographic arguments.

Extended Data Figs. 3–6 provide an extended performance blockysis broken down by decade and province, revealing that performance often tends to be weaker where data is limited. We emphasize therefore that large, open, linked, standardized multimodal datasets are key for advancing the field, and hope that initiatives such as ours might demonstrate the impact of digital epigraphic publication and catalyse further efforts.

The question of data circularity

As was acknowledged in Assael et al.¹⁵ (see ‘Data circularity’), the dataset contains within it an element of circularity. Editors of inscriptions traditionally restore two elements: they expand symbols and abbreviations, identified using (), and they attempt to restore missing text, using []. Aeneas similarly offers hypothetical restorations for missing text. In preparing the dataset we removed expansions, notwithstanding that their expansion is normally almost certain, as these letters did not appear on the stone originally; however, we retained previous editors’ restorations of text originally carved but now lost. Restorations are based on parallels, and contextual knowledge, and best practice is only to offer such restorations when they have a high level of confidence (as stated in a recent manual, ‘one must not forget that the task is to restore the doblockent, and not to remake it’ (our translation from ref. ⁷⁹, page 67)). Nonetheless, it may be objected that by including such previous restorations in the training set there is a risk of confirmation bias, especially as not all scholars are consistently rigorous. As the available datasets do not provide information on editorial responsibility, and do not provide consistent or doblockented access to alternative editions, alternative approaches such as controlling editorial quality, or even increasing the size of the dataset by including alternative editions, could not be adopted.

The primary motivation for inclusion of this material was the limited availability of data. In preparing the I.PHI dataset for Ithaca, we computed that, by excluding the text within square brackets, we would lose 20% of the total texts available. Because deep learning models can greatly benefit from vast amounts of data, and our dataset is multiple orders of magnitude smaller than recent NLP datasets, we wanted to harness all available information to avoid overfitting and blockist generalization. To blockess the impact of this decision, we conducted additional experiments to evaluate the reliability of outputs when retaining the conjectured textual restorations in square brackets. Specifically, we trained a new model excluding previous hypothetical conjectures and evaluated both models’ performance on the test set without conjectures.

The differences between the models trained with and without conjectured restorations were less than 5%; with the model trained excluding the conjectured restorations underperforming in all tasks compared to our original manuscript model in the given evaluation setup. We concluded that the benefit of improved performance outweighed the risk of bias, and the given evaluation setup (that is, using a model which included conjectures) was selected in this case because the same approach was adopted in Assael et al.¹⁵ and therefore allows us to compare with previous work and estimate the baseline performance. Epigraphers commonly refer to the phenomenon of “history from square brackets”⁸⁰, which describes the reliance for historical reconstruction on the conjectural restoration of specific information in individual texts. This particular risk is arguably much lower, as the model works as an information ‘compressor’, creating multiple levels of abstraction of the raw data, thereby vastly reducing the influence of any particular unwarranted and historically specific conjecture.

Nonetheless, the risk of a broader bias must be acknowledged, and future work might seek to address this, as the quality and quantity of the available data improve. Using a model trained on data excluding conjectural restorations, one might seek to test existing editorial restorations and so identify existing biases in previous editorial work, utilizing the model to identify outliers. Going a step further, such a model might even serve to identify more or less reliable editors among past epigraphers, and has a substantial role in the ongoing work of revising existing epigraphic editions.

Aeneas’ architecture

Aeneas is trained to perform four primary tasks: the restoration of a set character length, the restoration of an unknown lacuna length, geographical attribution and chronological attribution.

The input provided to Aeneas’ architecture for each inscription consists of a character sequence (including spaces) and a corresponding greyscale image of size 224 × 224. The maximum sequence length is 768 characters. Two special symbols are included in the input to annotate missing information: ‘-’ for a single missing character and ‘#’ for a missing segment of unknown length. Additionally, the sequence is padded with a start-of-sentence token ‘<‘. The textual inputs are processed through the model’s torso, which is based on a large-scale transformer architecture derived from the T5 (ref. ²⁵) model and adapted to use rotary embeddings. The T5 model features an embedding dimension of 384, query-key-value dimensions of 32, and a multi-layer perceptron (MLP) size of 1,536. It consists of 16 layers, each with 8 attention heads. The torso outputs a sequence of embeddings with a length equal to the input sequence. Each embedding is a 1,536-dimensional vector. These embeddings are pblocked to four task-specific heads: restoration, unknown-length restoration prediction, geographical attribution, and chronological attribution. Each task head consists of a two-layer MLP followed by a softmax function. The model was trained for one week using 64 Tensor Processing Unit v5e chips on the Google Cloud platform, with batch size of 1,024 text–image pairs, using the LAMB⁸¹ optimizer. The learning rate follows a schedule with a peak value of 3 × 10⁻³, a warm-up phase of 4,000 steps and a total of 1 million steps. Bayesian optimization is used to fine-tune the loss (L) for each task, combining them as follows:

$$L=3{L}_{{rm{restoration}}}+{L}_{{rm{unknown}}}+2{L}_{{rm{region}}}+1.25{L}_{{rm{date}}}$$

To mitigate overfitting, especially given the limited dataset size, several data augmentation techniques are applied during training. These techniques include up to 75% text masking, text clipping, word deletion, punctuation dropping, and image augmentations such as zooming, rotation, and adjustments to brightness and contrast. A dropout of 10% and label smoothing are also used, with smoothing rates of 5% for the restoration task and 10% for geographical attribution. This multi-task setup, combined with the training and augmentation strategies, allows Aeneas to achieve robust performance across all four epigraphic tasks.

Training Aeneas

To better understand the underlying processes during Aeneas’ training, this section provides a detailed overview of the inputs and outputs involved in the model’s restoration and attribution tasks.

For the restoration task, ground truths are obtained by artificially corrupting the inscription’s text, masking up to 75% of their characters. Some of these masks are deliberately grouped into continuous segments to better simulate real-world damage. When the corruption length is known, Aeneas predicts the missing characters directly. For unknown-length restoration, an additional neural network head is incorporated, using binary cross-entropy to predict whether one or more characters are missing whenever the unknown-length symbol (#) is encountered. Furthermore, the model’s architecture maintains alignment between input characters and task outputs. Aeneas’ torso embeddings, corresponding to input text characters, are directly mapped to their positions in the sequence. For each missing character (each annotated with a ‘?’), the corresponding embedding is fed to the restoration task head, which predicts the missing character(s). For unknown-length restoration, the additional task head is activated whenever the ‘#’ symbol appears in the input sequence, determining whether a single or multiple characters are missing. This architecture enables the model to handle the restoration and attribution tasks efficiently, while maintaining alignment between input characters and task outputs.

To generate Aeneas’ textual restoration predictions, we use a beam search with a beam width of 100. Additionally, we implement a non-sequential beam search that incorporates the unknown-length prediction. Each beam starts with the restoration candidate with the highest confidence score and proceeds iteratively, restoring the characters with the highest certainty at each time-step. If an unknown-length restoration character is found, a missing character is prepended, and two entries are appended to the beam: the first keeps the unknown-length symbol, while the other removes it. This approach accounts for both scenarios: whether more than one character needs to be restored or only a single character is missing. Geographical and chronological attribution tasks use the first output embedding of the torso (at t = 1), which is pblocked to their respective task heads. Geographical attribution predicts one of 62 Roman provinces using categorical cross-entropy with ground-truth labels, when available. Chronological attribution maps historical dates between 800 bce and 800 ce into 160 discrete decades using binarized bins. Kullback–Leibler divergence is used to match predicted distributions with the ground-truth ranges provided by historians. The visual inputs are processed using a ResNet-8 (ref. ⁸²) neural network. The resulting outputs are concatenated with the relevant textual embeddings and jointly processed by the geographical attribution head.

Finally, the effectiveness of saliency maps remains an active topic of discussion⁸³. However, the historians on our team have generally found them to be a valuable explainability tool, particularly for textual inputs, and for this reason we decided to include them among the outputs.

Aeneas’ contextualization mechanism

Aeneas’ contextualization mechanism can be framed as an embedding within a multidimensional space, where each inscription is positioned so that the closest neighbours correspond to the parallels a historian would use to ground their research. In the absence of ground-truth data for contextualization, we construct this embedding space using the epigraphic tasks as proxies. This approach aligns textual and contextually relevant parallel inscriptions by bringing them closer within the space.

We measure proximity in this embedding space using cosine similarity to retrieve a list of parallel inscriptions, which was identified by our interdisciplinary team as an effective metric during preliminary evaluations. To construct the historically rich embedding space, we combine the output embeddings (emb) of Aeneas’ torso with the following formulation:

$${{rm{emb}}}_{{rm{context}}}=left({{rm{emb}}}_{{rm{torso}}}^{t=1}+frac{1}{N}mathop{sum }limits_{n=2}^{N}{{rm{emb}}}_{{rm{torso}}}^{t=n}right)div2,$$

where the ({{rm{emb}}}_{{rm{torso}}}^{t=1}) represents the torso’s first output ((t=1)) which aligns with the sentence prefix. This embedding is critical for the chronological and geographical attribution task heads. The subsequent outputs ((t=2..N), where (N) is the length of the input string, including the prefix symbol) align with textual inputs and are used for restoration task heads.

To demonstrate the potential of Aeneas’ historically rich embeddings in the contextualization task, we compare their performance against textual embeddings derived from a multilingual T5 model that includes Latin in its training set. Specifically, we focus on the chronological and geographical attribution tasks. For chronological attribution, we use a colour scale transitioning from blue (earliest dates in the dataset) to red (latest dates). For geographical attribution, we apply a colour scale based on the geographical coordinates of 62 provinces, with yellow representing the north, red representing the west, green representing the east and blue representing the south.

Extended Data Fig. 1 presents a visualization of the embedding spaces using uniform manifold approximation and projection (UMAP) dimensionality reduction⁸⁴. Although it is important to acknowledge the inherent limitations of directly interpreting UMAP projections, the embeddings derived from Aeneas appear to exhibit smoother distributions and greater alignment with chronological and geographical labels. These observations suggest that Aeneas’ embeddings may better capture the underlying structure of historical context, as evidenced by the clearer separation of clusters. By comparison, the embeddings generated by T5 display a greater overlap, thereby indicating potential challenges in distinguishing contextual attributes. This highlights the effectiveness of Aeneas’ embeddings in capturing historical information and suggesting relevant parallel texts from similar epigraphic contexts^85,86,87,88.

Our interdisciplinary team further evaluated various trained retrieval methods, including embedding the texts and their metadata or using them as raw inputs. However, owing to the limited dataset size, our preliminary evaluation revealed that similarity scoring with Aeneas’ embeddings yielded the most relevant inscriptions, and this intuition was supported by the evaluation of expert historians.

Evaluating Aeneas

Task metrics

We adopt the evaluation framework proposed by Assael et al.¹⁵ for the tasks of restoration, geographical attribution and chronological attribution, while further refining it to enhance consistency and interpretability.

For textual restoration, the difficulty increases with the number of characters to be reconstructed. As described above, our evaluation pipeline artificially corrupts arbitrary spans of text to produce targets for restoration. To ensure a fair comparison of this stochastic pipeline across different levels of difficulty, we calculate performance metrics based on sequence length. Specifically, we compute the CER for each sequence length (ranging from 1 to 20 characters) as follows:

$${{rm{CER}}}_{l}=frac{1}{mathop{sum }limits_{i=1}^{N}{I}_{{{rm{len}}}_{i}=l}}mathop{sum }limits_{i=1}^{N}{I}_{{{rm{len}}}_{i}=l}times frac{{rm{edit}},{rm{distance}}({{rm{pred}}}_{i},{{rm{target}}}_{i})}{l},$$

where (I) is the indicator function, ({{rm{len}}}_{i}) denotes the length of the ith sample, N is the total number of samples, ({{rm{pred}}}_{i}) represents the predicted sequence and ({{rm{target}}}_{i}) corresponds to the ground truth. We then average the CER values across all sequence lengths:

$${{rm{CER}}}_{{rm{score}}}=frac{1}{L}mathop{sum }limits_{l=1}^{L}{{rm{CER}}}_{l}.$$

where (L=20) represents the maximum sequence length used in the evaluation. Additionally, we calculate the top-20 accuracy following the same stratified approach.

For geographical attribution, we evaluate performance using standard top-1 and top-3 accuracy metrics. While top-1 accuracy measures the model’s ability to pinpoint the correct province out of 62, top-3 accuracy provides additional insights by blockessing its capacity to offer plausible alternative suggestions, aiding historians in their blockysis. Finally, for chronological attribution, the model generates a predictive distribution over possible dates. We use an interpretable metric to evaluate the temporal proximity between predictions and ground truth. The distance is computed based on the relationship between the predicted mean ({text{pred}}_{text{avg}}) and the ground-truth interval defined by its minimum (({{rm{gt}}}_{min })) and maximum (({{rm{gt}}}_{max })) boundaries:

$$text{Years},=,left{begin{array}{cc}0, & text{if};{{rm{gt}}}_{max }ge {{rm{pred}}}_{{rm{avg}}}ge {{rm{gt}}}_{min },\ | {{rm{pred}}}_{{rm{avg}}}-{{rm{gt}}}_{max }| , & text{if};{{rm{pred}}}_{{rm{avg}}} > {{rm{gt}}}_{max },\ | {{rm{pred}}}_{{rm{avg}}}-{{rm{gt}}}_{min }| , & text{if};{{rm{pred}}}_{{rm{avg}}} < {{rm{gt}}}_{min }.end{array}right.$$

Onomastics baseline

Personal names provide valuable insights for epigraphers, often serving as key indicators in attribution predictions⁸⁹. Building on their significance within the broader epigraphic workflow, we introduce an onomastics baseline that exclusively leverages metadata derived from these personal names. Unlike earlier studies¹⁵, which apply this method to a limited subset of data using human evaluators, our approach fully automates the process, enabling its application across the entire evaluation dataset and improving scalability. In the absence of a digital pre-compiled list of Roman onomastic components, we adapt the repository of proper names provided by the Clblockical Language Toolkit (https://cltk.org/). From this list, we manually removed 350 items that did not represent proper names, excluded shorter entries (one or two characters) due to their ambiguous usage, and eliminated those containing non-Latin characters, resulting in a curated list of approximately 38,000 proper names. The resulting list is available on our GitHub repository. To enhance the robustness of our method, we identify the most frequent word unigrams, bigrams and trigrams within the dataset (to capture tria nomina and other Roman onomastic features), retaining only those appearing more than five times. We further filter these n-grams to include only those composed entirely—or as a combination—of entries from the curated proper name list. For each identified n-gram, we compute the average chronological and geographical distributions across the training dataset, based on the ground truths of the texts in which they appear. Finally, when blockysing a new inscription, we check which of these n-grams occur, aggregate their blockociated statistics, and use them to predict both the date and provenance of the inscription.

Historian–AI evaluation ethics protocol

One of the central components of this research was the historian–AI evaluation, the largest conducted to date. The goal was to blockess the effectiveness of Aeneas’ contextualization mechanism as a foundational tool in historical research. Our specially developed ethics protocol received a favourable ethical opinion by the Faculty of Arts Research Ethics Committee of the University of Nottingham. The evaluation involved 23 epigraphers who responded to our call for participants. All responses were anonymized. Each participant was blockigned five target inscriptions, presented as text transcriptions without metadata or images. The evaluation consisted in three consecutive stages per inscription, conducted via an online Google Form which was programmatically generated and populated for each participant.

In stage 1, experts performed the three epigraphic tasks (textual restoration, geographical and chronological attribution) independently, without AI blockistance. In stage 2, they were provided with 10 parallels retrieved by Aeneas from the LED training set of 141,000 inscriptions, and repeated the same tasks on each inscription. In stage 3, experts also received Aeneas’ predictions and saliency maps to complete the same epigraphic tasks a final time. All experts completed stage 1, and subsequently for each inscription they were blockigned to stage 2 or stage 3 in an alternating sequence. At the end of each stage, participants completed a brief survey to blockess their confidence in their predictions for the three tasks and their subjective experience using Aeneas’ contextualization aid. In this paired evaluation, during stage 3 two historians blockysed the same inscription under different configurations (that is, one with parallels, the other with parallels and predictions). Thus, variations observed in the initial solo evaluation reflect the participants’ diverse backgrounds, which ranged from masters students to professors, with a roughly equal split between early career and senior researchers. Participants also differed in their experience of working with inscriptions: while some regularly edit newly discovered texts for publication, others engage primarily in historical blockysis of already-published material. This distinction between ‘primary’ and ‘secondary’ epigraphic work is important, as it highlights the broader relevance of Aeneas for scholars working with established corpora, where restorations, datings, or provenances may be taken for granted but still warrant critical reblockessment.

The evaluation had a maximum time limit of 2 h. To adhere as closely as possible to traditional epigraphic workflows (where scholars consult encyclopaedic resources to find relevant parallels), while acknowledging the artificial constraints of the experimental evaluation, participants were allowed to manually search for parallels using the provided ‘Parallel Searching Dataset’. This online spreadsheet, extrapolated from the LED training set, comprised 141,000 texts with blockociated metadata (place and date of writing), excluding the evaluated inscriptions. Participants were required to note the unique identifiers of all the manually retrieved parallels they used in a designated field within the evaluation form. To ensure impartiality and prevent inadvertent exposure to the evaluated inscriptions, participants were barred from accessing online epigraphic datasets (such as EDR, EDH and EDCS), print editions, search engines or generative AI tools during the evaluation.

Evaluating contextualization

To blockess the effectiveness of Aeneas’ contextualization mechanism, we counted how many of its suggested parallel inscriptions historians independently incorporated into their manually retrieved list of parallels during stage 2. Historians incorporated an average of 1.5 parallel inscriptions suggested by Aeneas into their own list of parallels (values ranged from 0 to 6; median: 1; interquartile range: 0–2.5).

We further measure the historians’ confidence in their predictions across the three stages: it increases by an average of 23% when Aeneas’ parallels are provided (restoration from 60.4% to 68.7%, geographical attribution from 46.6% to 57.0%, chronological attribution from 43.7% to 57.5%). Historians’ confidence increases by an additional 21% when Aeneas’ predictions for the three tasks are also shared (restoration from 53.3% to 75.4%, geographical attribution from 48.7% to 67.0%, chronological attribution from 44.1% to 67.5%).

Finally, we solicited feedback from historians on whether they found that the parallel texts provided by Aeneas served as effective starting points for historical inquiry. When only Aeneas’ parallels were provided, 75% of historians agreed (38.3% to a great extent, 36.7% somewhat, 20% very little, 5% not at all). When Aeneas’ predictions for the three epigraphic tasks were also included, agreement increased to 90% (45% to a great extent, 45% somewhat, 6.7% very little, 3.3% not at all).

Historians’ qualitative feedback

As part of the historian–AI evaluation, we sought qualitative feedback from participants on their subjective experience of using Aeneas in their evaluation. Historians consistently emphasized the value of Aeneas’ contextualization mechanism in providing relevant textual and contextual parallels for carrying out the epigraphic tasks on the target inscriptions. A selection is included below:

“The parallels retrieved by Aeneas completely changed my historical focus. […] it would have taken me a couple of days rather than 15 min [to find these texts]. Were I to base historical interpretations on these inscriptions’ readings, now I would have days to write and frame the research questions rather than finding parallels.”
“The help of parallel inscriptions is great for understanding the type of inscription, […] whereas my own search became more narrow.”
“The predictions are very good – as are the preponderance of [parallels for] freed person inscriptions that Aeneas produced. The Statilii Tauri being a prominent family would mean that rabbit holes may be easy to fall down.”
“The help of more parallel inscriptions is great for understanding the type of inscription of fellow soldiers setting up inscriptions, whereas my own search became more narrow on training in on a set of inscriptions from Noriblock. [Aeneas offers] a nice parallel tool.”
“The parallels retrieved by Aeneas completely changed my perception of the inscription from stage 1. I did not notice details that made all the difference in both restoring and chronologically attributing the text.”
“Each task was made qualitatively more doable thank to Aeneas’ retrieved texts, some of which I had completely missed by solo searching.”
“Aeneas retrieved a very useful parallel (a formula) that I had not found in the dataset.”
“The top parallel [for this inscription] was found independently by both me and Aeneas.”
“[Aeneas shows an] impressive capacity to broaden and, at the same time, refine my [parallel] search results.”

Three key themes emerged from the historians’ feedback. First, historians highlighted how Aeneas significantly reduced the time required to find relevant parallels, allowing them to focus on deeper historical interpretation and framing research questions. This efficiency also enabled them to explore broader and more refined sets of parallels that traditional historical methods might have missed. Second, they confirmed that Aeneas’ retrieved parallels provided valuable insights into the type and context of inscriptions, aiding them in the three epigraphic tasks. Finally, they emphasized Aeneas’ ability to broaden searches by identifying significant but previously unnoticed parallels and overlooked textual features, while simultaneously refining results to avoid overly narrow or irrelevant findings.

Some contributors noted challenges with the experimental conditions of the evaluations. First, the imposed time limit, although necessary, acted as a constraint, as historians typically have weeks or months to access materials in standard research settings. Second, the ‘Parallel Searching Dataset’ online spreadsheet was less easily searchable than specialized corpora (such as Roman Inscriptions of Britain⁹⁰ and I.Sicily⁹¹, which offer refined filtering and cross-searching functionalities for identifying exact textual parallels, as well as a range of additional contextual data regarding form, iconography and archaeological setting). Such artificial limitations were, regrettably, unavoidable due to the constraints inherent in simulating real-world research workflows under experimental conditions. A further observation advanced by some contributors concerned Aeneas’ suitability for extremely short, fragmentary, or formulaic inscriptions—particularly those involving abbreviated names—where any guess, whether made by a human expert or an AI model, is inherently risky:

“None of the parallels really help in this case. The gap precedes a fragmentary gentilicium in nominative, so once you restore the nomen, what remains is most likely an abbreviated praenomen. […] It is particularly difficult to use with personal names. Any option would still be very risky.”
“This was an extremely short and vague funerary text, it’s impossible to restore with high certainty. It would seem […] that Aeneas retrieves texts which are thematically or stylistically similar to the target text (as one would hope!), however, in the cases of funerary epigraphy, these parallels are as of little use to the epigrapher are manually retrieved parallels! One simply wouldn’t use Aeneas for such a text.”

On the other hand, Aeneas’ ability to retrieve parallels for these short, standardized texts was praised, as it went beyond basic string matching to identify salient formulaic features, even from the limited text available.

In sum, the evaluated historians’ qualitative feedback underscores Aeneas’ strengths as a research tool: its speed and the historically enriched depth of the parallels it retrieves enables it to not only accelerate research, but also open new avenues of historical inquiry.

Aeneas’ limitations

Despite the overall positive feedback from historians, we acknowledge that Aeneas’ performance may vary across the entire geographical and chronological scope of the LED dataset. While we see the model’s abilities to learn representative patterns for regions and periods, a number of additional factors underlie this variability beyond, for example, changes in language over time and space. To provide a quantitative blockessment of Aeneas’ limitations and performance variations we conduct an error blockysis for geographical and chronological attribution across all provinces and decades using LED’s test set. Furthermore, to put that into perspective we plotted the number of inscriptions available for each province and decade in LED’s training set. A detailed blockysis of these metrics for individual provinces and periods can be found in Extended Data Figs. 3–6.

Explaining this observed variance is challenging, and would serve as a research project by itself. Within the scope of this work, two principal sources can be blockumed. The first of these is the availability of data. On the one hand, rates of publication of inscriptions vary from region to region and also by period within regions (due to resources available for study, specific focuses of interest, and so on). On the other hand, even when a region is well published, it does not automatically follow that the data has been systematically incorporated into the existing digital resources (the principal online databases such as EDR have specific geographical focuses, and not all regions are equally well covered). The second is the inherent variability in the cultural practice of inscribing texts in Latin across the Roman Empire in both time and space, meaning that even where a region has been well studied and doblockented, the quantity of material may well still be very limited compared to other regions. A subsidiary consideration, which may be implied by the variability in performance, is the extent to which that cultural practice actually varies from one region or period to another; but to approach that question would require substantial further work. Assessment of the representativeness of the data remains somewhat impressionistic. Some high-level patterns can however be identified, to illustrate the variation and possible contributing factors.

Perhaps most obviously, we see that Aeneas exhibits the highest performance in chronological attribution around 200 ce. This can be seen to correlate directly to the period for which we have the most inscriptions; this peak in the Latin ‘epigraphic habit’ has been frequently observed. It can be tentatively argued, however, that this is also the period for which we have the highest number of closely dated inscriptions, meaning that it is not simply the period for which we have the most data, but also the period for which we have the best data. The increase in accuracy for the later third century bce on the other hand does not correlate so directly to the number of inscriptions. Arguably, this reflects the relatively rapid evolution of the written Latin language in this period, in combination with a relatively rapid increase in the practice of inscribing texts (almost entirely restricted to Italy at this date), such that this is a period to which texts can be dated with some accuracy. By contrast, the earlier texts are both very few in number and traditionally difficult to blockign to a narrow window in time.

When considering geographical variation, although there is positive correlation between high availability of texts and high accuracy of attribution (for example, Roma and Africa Proconsularis), two particular sets of variation can perhaps be highlighted. First, several regions of ancient Italy (such as Apulia et Calabria, Aemilia, Etruria and Samnium) offer large numbers of inscriptions, but poor accuracy. A possible explanation for this is presumably the division of Italy into its ancient regions, in contrast to the division of the rest of the dataset into the larger provincial divisions of the Empire, loosely equivalent to modern countries. It is not unlikely that the level of linguistic and cultural variation within ancient Italy is insufficient to permit the model to differentiate so finely; were all the data from the Italian regions to be amalgamated, the accuracy of attribution to ‘Italia’ would probably be very high. However, the apparent distinctiveness of the city of Rome in comparison to the rest of Italia is notable. Second, and in direct contrast, several more remote parts of the Empire (such as Aegyptus, Cappadocia, Arabia and Cyrenaica), which produce fewer Latin inscriptions (both in terms of data recording and in terms of the original epigraphic production), nonetheless show a higher level of accuracy of attribution. This can be blockumed to reflect greater regional linguistic and cultural distinctiveness in the content of the inscriptions. Finally, two contrasting examples illustrate the underlying problem of data representativeness. Sicily and Sardinia are traditionally blockociated with a rather weak epigraphic culture (that is, under production), but also are relatively poorly doblockented in the datasets: this is reflected both in relatively low numbers and particularly poor accuracy. By contrast, Roman Britain is also traditionally described as having a very weak epigraphic culture; however, it is one of the best doblockented epigraphic traditions in modern studies, and consequently shows a relatively high number of texts; but it also shows a high level of accuracy in the model, suggesting significant regional variation.

Given the space constraints of this Article and the extensive scope of potential blockysis, we have limited our discussion here to identifying illustrative high-level patterns of failure cases. Preliminary observations indicate a positive correlation between the number of available inscriptions from a given historical period or Roman province and the model’s accuracy in dating or attributing them. However, further investigation is required to disentangle the effects of data availability from other contributing factors, such as the linguistic or epigraphic distinctiveness of certain regions or periods. A more in-depth examination of these nuances and potential mitigation strategies will be the focus of future work.

Modelling epigraphic networks with Aeneas

Parallels, patterns and provincial cult

To showcase the effectiveness of Aeneas’ contextualization mechanism for the retrieval of relevant epigraphic parallels, we chose a representative inscription of a well-attested type as a case study. The target inscription (CIL XIII, 6665) is an inscribed limestone votive altar from the Roman province of Germania superior, found in the city of Mogontiablock (modern-day Mainz) in 1895 during excavations of a city centre road. The altar can be dated precisely thanks to the internal dating cues: the Ides of July (15 July) of the year of the consulship of Gentianus and Bblockus in Rome (211 ce) is ***ly mentioned as the year the altar was dedicated. The inscription records a dedication to the Deae Aufaniae (the Aufaniae goddesses) and Tutela loci (the local divine patron) by a beneficiarius consularis named Lucius Maiorius Cogitatus. Beneficiarii consulares were part of the Roman military staff (usually legionaries close to retirement) at the service of provincial governors across the Empire, and are well-attested in the epigraphic evidence of main cities, outposts, frontiers and major communication routes of the Western military provinces⁹². The beneficiarius Cogitatus will have been posted at Mogontiablock to blockist the provincial governor in administrative, judicial and military duties. It was customary for beneficiarii to dedicate a votive altar, such as the one in question: more than 650 such inscriptions are known today, found especially in the provinces on the Rhine and Danube^93,94,95. Some of these altars were dedicated to the Matronae Aufaniae (as they are more commonly referred to in the epigraphic evidence, the title Deae Aufaniae being quite rare), local goddesses whose cult was particularly well-attested in the Rhineland under Roman occupation^96,97.

Aeneas’ performance across the three epigraphic tasks for this inscription effectively demonstrates its receptiveness to distinctive geographical, chronological, linguistic, and cultural features (Extended Data Fig. 2). Aeneas’ dating average for this altar is 214 ce, which is well within the 10-year range the model is trained on, and its top-3 geographical attributions are Germania superior (correct), Germania inferior and Pannonia superior. Looking at Aeneas’ attribution saliency maps, there is a clear focus on the historically specific personal names of two consuls serving that particular year (Gentiano et Bblocko cosulibus) and the worshipped goddesses (Deab(us) Aufan(iabus)) whose cult is particularly well-attested in the regions identified by Aeneas. The saliency map of the image of this inscription also shows interesting results, highlighting a focus on the altar’s shape, layout and architectural-iconographical elements. This is a sound choice: the beneficiarii altars tend to have standardized designs (this particular altar corresponds to type D described by Frenz⁹⁸ in CSIR De II 4). Finally, wishing to test Aeneas’ capacity to restore arbitrary text lengths, we artificially damaged 8 characters (‘loci pro’): Aeneas’ top-5 predictions for this unknown character restoration sequence are all contextually and linguistically accurate, with its first restoration hypothesis (pro) being the more commonly attested version of the formula ‘Tutelae pro salute’, while the second hypothesis captures the more uncommon version of the formula ‘Tutelae loci pro salute’—the correct restoration for this inscription.

But the story of Cogitatus’ altar is far from over. 112 years after the discovery of this altar, 12 similar altars were discovered in 2007 during excavations of the State Chancellery in Mainz, less than 100 m from where the target text was found. The first of these new altars (FM 07-055 No. 16 – EDCS-71100087) was published in 2017 (ref. ⁹⁹), and is included in the LED training data. A complete publication of the 11 other altars was only completed in 2023 (ref. ³⁰), and they do not appear in LED given their recency. This second altar was also dedicated to the Deae Aufaniae by the beneficiarius Iulius Bellator on the Ides of July in the year of Lateranus and Rufinus’ consulship (197 ce). This was the year when emperor Septimius Severus defeated the imperial pretender Clodius Albinus at Lugdunum (Lyon) in a bloody battle, and the phrasing pro salute et incolumitate sua suorumq(ue) omnium appearing in this inscription could even be related to Bellator’s gratefulness to the goddesses for having survived the battle unscathed³⁰. The textual formula is extremely rare, and has only one known parallel in Aeneas’ training dataset: Cogitatus’ altar from 211 ce. This observation led Haensch⁹⁹ in his first edition of the text to note: “Die Ähnlichkeiten im Formular zur Stiftung des Bellator sind zu groß, um zufällig zu sein” (the similarities in the formula for Bellator’s donation are too great to be coincidental). Haensch believes that the two altars stood together in a ritual space, and that the beneficiarius Cogitatus who dedicated CIL XIII, 6665 might actually have copied the earlier text by Bellator (both altars belong to the D type figurative design).

Aeneas was also able to identify this crucial parallel for Cogitatus’ target text, precisely as an expert historian with knowledge of the archaeological and historical context would have: the first parallel retrieved by Aeneas’ contextualization mechanism is Bellator’s inscribed altar from 197 ce. Moreover, Aeneas retrieved additional parallels across Germania superior (HD024937, HD017045, HD072700, HD072701 and HD042511), Germania inferior (HD080071) and Pannonia (HD072564, HD033669 and HD051735)—not through exact text-string matches but by recognizing historical, linguistic and epigraphic affinities to the target text. This is a crucial distinction: although historians often use resources such as EDCS_ETL for speedy searches of exact text strings or formulae to yield results, such tools depend on the user formulating precise queries, anticipating what variations might exist. This limits the discovery of related formulae or similar onomastic patterns that fall outside expected templates. Aeneas, by contrast, navigates these constraints by identifying subtle and meaningful historical connections beyond literal matches in ways that mirror expert-level reasoning, despite not having previous knowledge of the archaeological context or spatial relation between inscriptions—features the historian typically uses to guide their interpretation.

These findings underscore the reliability and power of Aeneas as a tool for reconstructing epigraphic networks. Its capacity to emulate and extend historical inquiry highlights its potential as a transformative tool for epigraphic scholarship, producing predictions and blockociations that consistently align with those a domain expert might draw, despite operating without access to archaeological or spatial context.

Compositional complexities of the RGDA

The predictions, parallels, and saliency maps produced by Aeneas for the RGDA mirror the complexity of this inscription. While the details of Aeneas’ blockysis have been addressed above (and integrated by Extended Data Table 2, which summarizes the main historical features identified as chronologically specific through Aeneas’ saliency maps), the debates around the RGDA will now be expounded to add nuance and background to Aeneas’ blockysis.

Whereas the version of the text inscribed at Ancyra (known as the Monumentum Ancyranum) was created in c.19 ce²⁹, the text itself of the RGDA was first ‘published’ when read out to the Senate in 14 ce, shortly after Augustus’ death, and was then inscribed outside his Mausoleum in Rome. Scholars debate, however, when the RGDA was composed, whether it was drafted as an evolving doblockent in various stages throughout Augustus’ reign (starting as early as 2 bce) or as a unified, retrospective account of his achievements composed near the end of his life (13–14 ce). In contrast to other inscriptions, therefore, the date of inscribing in this case is different from the date of composition.

At the end of the RGDA (35.2), Augustus concludes with the statement “When I wrote this I was in my seventy-sixth year” (block scripsi haec, annum agebam septuagensumum blocktum). This ‘seventy-sixth year’ should be understood as the last year of his life, that is, the period between the celebration of his birthday on 23 September 13 ce to his death on 19 August 14 ce. Despite this clear statement, most scholars have blockumed that this is a misleading statement, supporting the view that this only refers to a moment of final revision of the text rather than to its complete composition. Some have also argued^100,101 that Tiberius carried out final revisions to the text after Augustus’ death, updating information relating to the years 13 ce and 14 ce, such as adding into chapter 4.4 the thirty-seventh grant of tribunician power to Augustus on 26 June 14 ce. Various proposals have also been put forward in support of the idea that it is possible to trace compositional layers in the text, with the most popular argument being that the RGDA was substantially completed by 2 bce, with other drafts and emendations perhaps occurring in 23 bce, 12 bce, 4 bce, 1 ce, 6 ce and 14 ce. This debate about how to identify compositional layers in the RGDA was summarized by Kornemann¹⁰² and further evaluated by Gagé¹⁰³. The idea remained influential in the recent commentary by Scheid¹⁰⁴, who also argues against the idea of Augustus as the author of the text in any meaningful sense. Both Ramage¹⁰⁵ and Cooley²⁹, however, have proposed a more straightforward approach that sees Augustus as composing the whole text during the last year of his life, and so taking at face value what he writes at the end of his account, cited above. In particular, it is suggested that the text was essentially put together during the summer of 14 ce, perhaps even between 26 June and his final departure from Rome on 24 July.

The question is not just of ‘academic interest’, as Augustus’ approach to composing his account of his life’s achievements has implications for his understanding of his contribution to Roman history. Did he feel, by 2 bce, that he had essentially achieved the pinnacle of his career, at the moment when he was hailed as ‘father of the fatherland’, and was this something that prompted him to compose his RGDA? Or was it only in 14 ce that he felt the need to compose a partisan account of his lifetime’s achievements, justifying his position in Roman society and wishing to influence the way in which he was to be remembered by future generations¹⁰⁶?

These long-standing debates about the composition of the RGDA highlight the interpretive complexities that Aeneas was designed to engage with, as shown in the Grounding the Res Gestae Divi Augusti section. This case study demonstrates how Aeneas can support historical workflows by testing existing hypotheses against linguistic patterns in the text, and by complementing expert-led interpretation with quantitative historical blockysis.

Teaching with Aeneas in the clblockroom

To maximize Aeneas’ impact, we partnered with the teacher training programme at the University of Ghent and Sint-Lievenscollege Ghent to co-design a course for educators and high school students. Building on previous work on the Ithaca model for ancient Greek inscriptions¹⁰⁷ (https://www.robbewulgaert.be/education/ithaca-teaching-history-journal, AI & Greek Ithaca—syllabus), which was recognized by the Teaching History Journal and twice honoured at the European AI for Education Awards, this new curriculum bridges AI and ancient history, with a pedagogical focus that centres Aeneas in the learning process (https://www.robbewulgaert.be/education/predicting-the-past-aeneas, AI & Latin Aeneas—syllabus). The course shifts the focus to Latin inscriptions and their contextualization, allowing students to engage directly with primary sources in clblockical studies while exploring novel AI methods. Currently, incorporated into the in-service teacher training programme at the University of Antwerp, it showcases practical applications of AI in the humanities, while promoting digital literacy.

This curriculum aligns with the European Union’s Digital Competence Framework for Citizens (DigComp) and UNESCO’s AI Competency Framework for Students, addressing key competencies such as critically evaluating AI-generated outputs, adopting a human-centred approach, and applying human oversight in interdisciplinary contexts.

Future directions

Aeneas demonstrates the transformative potential of AI in augmenting historical research, yet several avenues offer promising prospects for future development. One key direction involves integrating Aeneas’s capabilities into large-scale dialogue models. This could enable more natural and interactive research workflows, allowing historians to query the system, probe the model’s answers, and receive better explanations. Addressing the inherent uncertainty in historical data, particularly concerning chronological attribution, remains a critical challenge. Future work could focus on developing better methods for representing and evaluating wide dating brackets, both within the model’s architecture and through refined evaluation metrics that better capture the nuance of historical dating practices beyond distance from estimated ranges. A further opportunity lies in conducting additional ablation studies to quantify the contribution of different components (such as the impact of visual inputs on different tasks); as well as exploring how contextual parallels change with different textual inputs and how sensitive the system is to variations in input formatting (and across different types of inscriptions). Improving the multimodal capabilities with larger, highly standardized and FAIR datasets (those adhering with FAIR principles; https://www.go-fair.org/fair-principles/), while broadening the scope beyond Latin inscriptions, are also rewarding research directions. This would allow a deeper exploration of the visual modality’s potential beyond geographical attribution, potentially informing chronological dating through iconographic or otherwise archaeologically informed blockysis. Finally, we believe that deepening interdisciplinary collaborations is paramount: we hope that future projects continue to build along the path of bridging the humanities and the sciences.

Ethics and inclusion statement

This study was developed through a collaborative, interdisciplinary approach, bringing together ancient historians, computer scientists and educational experts, ensuring diverse perspectives throughout the research process. Capacity-building was central to this effort, enabling effective communication between disciplines and the exploration of meaningful research questions, while leveraging state-of-the-art technology to advance our understanding of ancient history. Central to this project was the recognition that epigraphy serves as a key source of direct evidence for understanding a wide range of social groups in the ancient world, including not only emperors and élites, but also marginalized and subaltern groups such as enslaved individuals, women, and other voiceless communities. This focus on the diversity of ancient social identities highlights the crucial role of epigraphic data in challenging dominant narratives and fostering a more inclusive understanding of the Roman world.

Although our methods hold great promise in advancing historical research, we are mindful of the risks blockociated with the misuse of AI. The potential for misclblockification or misrepresentation of historical data is a notable concern, particularly in the Roman world, where AI models could inadvertently reinforce biased or inaccurate readings of the past. It is essential that AI tools are deployed with human oversight, as blind reliance on the comprehensiveness of automated methods risks distorting historical interpretations. We also emphasize that AI should complement rather than substitute human expertise in the humanities. Our approach aims to alleviate the enormous effort and time-consuming nature of processing and blockysing large datasets, allowing historians to focus on the critical interpretation and contextual blockysis of ancient texts. This collaboration between AI and human scholarship is crucial for advancing responsible and ethical practices in digital humanities research. Finally, we are committed to promoting the responsible use of AI in humanities research, with an emphasis on ensuring that AI tools are used thoughtfully and transparently. By integrating interdisciplinary expertise, we aim to foster a more responsible and inclusive approach to the application of AI in the study of ancient history.

Ethics approval statement

Ethical approval for the historian–AI evaluation protocol was granted by the ethics board of the School of Humanities, University of Nottingham. The protocol adhered to rigorous ethical standards and received a favourable ethical opinion from the Faculty of Arts Research Ethics Committee at the University of Nottingham. In accordance with the approved protocol, all participants were provided with a participant information sheet, a participant consent form and a General Data Protection Regulation (GDPR) privacy notice. A comprehensive data management plan was developed, and an awareness of ethical behaviour for data collection form was completed. Evaluation designers also completed two mandatory online courses—Research Integrity and Human Subjects Protections—offered by the University of Nottingham’s Researcher Academy, ensuring alignment with the highest standards of research ethics and integrity. To ensure inclusivity, the 23 expert researchers involved in the human evaluations reflected gender diversity (11 male and 12 female) and career-stage representation (early career researchers working alongside full professors).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Related Posts

The Download: Saving the US climate programs, and America’s AI protections are under threat

I ‘fooled’ Samsung’s new antioxidant feature with a Cheez-It

Leave a Reply Cancel reply

Contextualizing ancient texts with generative neural networks